» * 



L Number 


Hits 


Search Text 


nn 

UD 


Time stamp 


1 


Joy 


717/150xcls. 717/159-161.ccls. 


US-PGPUB; 
EPO; JPO; 

TRM THB 




4 


6569 


prefetch pre adj fetch 


USPAT; 


2004/01/29 11:50 






US-PGPUB; 
EPO; JPO; 

TRM Tr»R 
1DI V I_ 1 UD 




7 


27 


cache adj management adj instruction 


USPAT; 
US-PGPUB; 
EPO; JPO; 

TRM THR 
X D l v l 1 UD 


2004/01/29 13:50 


9 


±yu4 


(detail section) adj loop 


UOrn 1 f 

US-PGPUB; 
EPO; JPO; 

iDrl 1 UD 




10 


590 


loop nearS (unroll$3 un adj roil$3 restructur$3) 


USPAT; 


2004/01/29 12:00 






US-PGPUB; 
EPO; JPO; 

TRM THR 




11 




(\r\r\n naar^ ^iinmll<t'^ i in a/Hi rnll't'^ roctri ir+i irtfc^^ anrl f nrpfptrh 

(joop nearD ^unronip.3 un auj ruu^o rcstructunpj^ anu iprciciui 
pre adj fetch) 


UJrn 1 f 

1 JC.pCplJR- 

EPO; JPO; 

TRM THR 
iDrl 1 UD 


2004/01/29 12*02 


13 


4 


((loop nearS (unroll$3 un adj roll$3 restructur$3)) and 


USPAT; 


2004/01/29 13:37 






(prefetch pre adj fetch)) and ((detail section) adj loop) 


US-PGPUB; 








EPO; JPO; 

TRM THR 

IdM 1 UD 




15 


10 


(prefetch pre adj fetch) and (cache adj management adj 
instruction) 


USPAT; 
US-PGPUB; 
EPO; JPO; 

TRM THR 
!Dl v l_l UD 


2004/01/29 13:44 


16 


2 


((prefetch pre adj fetch) and (cache adj management adj 
instruction)) and (loop nearS (unroll$3 un adj roll$3 
restructur$3)) 


US-PGPUB; 
EPO; JPO; 

TRM THR 

ibrl 1 UD 




17 


3872354 


code loop source object 


USPAT; 


2004/01/29 13:48 






US-PGPUB; 
EPO; JPO; 

TRM TnR 

IBM 1 Ud 




18 


9 


((prefetch pre adj fetch) and (cache adj management adj 
instruction)) and (code loop source object) 


USPAT; 
US-PGPUB; 
EPO; JPO; 

TRM THR 
lDrl_l UD 


2004/01/29 13:49 


20 


156 


((prefetch pre adj fetch cache adj management) adj 
instruction) near3 (source code loop) 


USPAT; 
US-PGPUB; 
EPO; JPO; 
IBM TDB 


2004/01/29 13:52 


21 


32 


((prefetch pre adj fetch cache adj management) adj 
instruction) near3 (loop) 


USPAT; 
US-PGPUB; 
EPO; JPO; 
IBM TDB 


2004/01/29 13:52 



Search History 1/29/04 2:13:41 PM Page 1 

C:\APPS\EAST\Workspaces\method and apparatus for reducing cache thrashing.wsp 



United States Patent [19] 

Nishiyama et al. 



US005950007A 
[li] Patent Number: 
[45] Date of Patent: 



5,950,007 
Sep. 7, 1999 



[54] METHOD FOR COMPILING LOOPS 

CONTAINING PREFETCH INSTRUCTIONS 
THAT REPLACES ONE OR MORE ACTUAL 
PREFETCHES WITH ONE VIRTUAL 
PREFETCH PRIOR TO LOOP SCHEDULING 
AND UNROLLING 

[75] Inventors: Hiroyasu Nishiyama, Kawasaki; 

Sumio Kikuchi, Mac hid a, both of 
Japan 

[73] Assignee: Hitachi, Ltd., Tokyo, Japan 

[21] Appl. No.: 08/675,964 

[22] Filed: Jul. 5, 1996 

[30] Foreign Application Priority Data 



Jul. 6, 1995 [JP] Japan 7-170674 

[51] Int. CI. 6 G06F 9/45 

[52] U.S. CI i 395/707; 395/705 

[58] Field of Search 395/705, 706, 

395/707, 709, 710 

[56] References Cited 

U.S. PATENT DOCUMENTS 

5,303,357 4/1994 Inoue et al 395/709 

5,367,651 11/1994 Smith et al 395/709 

5,491,823 2/1996 Ruttenberg 395/709 

5,557,761 9/1996 Chan et al 395/709 

5,664,193 9/1997 Tirumalai 395/705 

5,704,053 12/1997 Santhanam 395/383 

5,752,037 5/1998 Gornish et al 395/709 

5,761,515 6/1998 Barton, III et al 395/709 

5,794,029 8/1998 Babaian et al 395/588 

5,797,013 8/1998 Mahadevan et al 395/709 

5,809,308 9/1998 Tirumalai 395/709 

5,819,088 10/1998 Reinders 395/672 



5,835,776 11/1998 Tirumalai et al 395/709 

OTHER PUBLICATIONS 

S. Ramakrishnan, "Software Pipelining in PA-RISC Com- 
pilers," Hewlett-Packard Journal, Jun. 1992, pp. 39-45. 
T. Mo wry et al, "Design and Evaluation of a Compiler 
Algorithm for Prefetching," Proceedings of the Fifth Inter- 
national Conference on Architectural Support for Program- 
ming Language and Operating Systems, 1992, pp. 62-73. 
D. Callahan et al, "Software Prefetching/' 1991 ACM 
0-^9791-380, pp. 40-52. 

G. Kurpanek et al, "PA7200: A PA-RISC Processor with 
Integrated High Performance MP Bus Interface," 
Hewlett-Packard Company, 1994 IEEE, pp. 375-382. 

Primary Examiner — Zarni Maung 

Assistant Examiner — Andrew Caldwell 

Attorney, Agent, or Firm — Fay, Sharpe, Be all, Fagan, 

Minnich & McKee 

[57] ABSTRACT 

Prefetch instructions having a function to move data to a 
cache memory from main memory are scheduled simulta- 
neously with execution of other instructions. The prefetch 
instructions are scheduled by replacing, with the original 
prefetch instructions, the virtual prefetch instructions 
obtained by unrolling a kernel section of the schedule 
constituted by generating a dependency graph having depen- 
dent relationships between the prefetch instruction and the 
memory reference instruction, and then applying the soft- 
ware pipelining thereto, or by further unrolling the kernel 
section of the constituted schedule to delete the redundant 
prefetch instructions, or further by applying the software 
pipelining to the dependency graph which is formed by 
combining a plurality of prefetch instructions and replacing 
the prefetch instructions with virtual prefetch instructions. 

6 Claims, 16 Drawing Sheets 




{J, FLOW OF CONTROL 
| FLOW OF DATA 



01/29/2004, EAST version: 1.4.1 



U.S. Patent 



Sep. 7, 1999 



Sheet 1 of 16 



5,950,007 



,--117 



INTERMEDIATE 
LANGUAGE 



102' 



103 



INSTRUCTION 
SCHEDULE 



INTERMEDIATE 
LANGUAGE 



DEPENDENCY 
GRAPH 



INSTRUCTION 
SCHEDULE 




■104 



•105 



INSTRUCTION 
SCHEDULE 



"107 



INSTRUCTION 
SCHEDULE 



1 08 



FLOW OF CONTROL 
| FLOW OF DATA 



FIG. 1 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Sep. 7, 1999 



Sheet 2 of 16 



5,950,007 




SOURCE CODE 
203 



FIG. 2 



302 




FIG. 3 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Sep. 7, 1999 



Sheet 3 of 16 



5,950,007 



INTERMEDIATE 
LANGUAGE 



102 



3 




INSTRUCTION 
SCHEDULE 




103 



PREFETCH INSTRUCTION 
GENERATOR 



109 



DEPENDENCY GRAPH 
GENERATOR 



111 



T 



112 



1 



SOFTWARE PIPELINING 
SECTION 



^ FLOW OF CONTROL 
( FLOW OF DATA 



101 




104 



INTERMEDIATE 
LANGUAGE 



DEPENDENCY 
GRAPH 



105 



FIG. 4 



801 



B-«— SIZE OF CACHE BLOCK 
D-«— SIZE OF REFERENCE 
OBJECT ELEMENT 



- (start) 



802 



FIG. 8 



UNPROCESSED 
PREFETCH INSTRUCTION 
REMAINS? 

YES 



NO 



803 



PR COPY OF THE SAME 
UNPROCESSED INSTRUCTION 



Q END ) 



804a 




804b 



YES 



1 



DELETE 
PREFETCH 
INSTRUCTION PR 



01/29/2004, EAST Version: 1.4.1 



I J 

U.S. Patent Sep. 7, 1999 Sheet 4 of 16 5,950,007 



INTERMEDIATE 
LANGUAGE 



102' 



103 



1 




PREFETCH INSTRUCTION 
GENERATOR 



111 



112 



113 



\ 

1 



±2L 



DEPENDENCY GRAPH 
GENERATOR 



SOFTWARE PIPELINING 
SECTION 



LOOP UNROLLING SECTION 



114 



0 



PREFETCH INSTRUCTION 
DELETING SECTION 



INSTRUCTION 
SCHEDULE 




JJ FLOW OF CONTROL 
| FLOW OF DATA 



■101 




-109 



•104 



INTERMEDIATE 
LANGUAGE 



L 



105 



DEPENDENCY 
GRAPH 



106 



INSTRUCTION 
SCHEDULE 



INSTRUCTION 
SCHEDULE 



07 



FIG. 5 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Sep. 7, 1999 Sheet 5 of 16 5,950,007 



INTERMEDIATE 
LANGUAGE 



102' 



103' 




INSTRUCTION 
SCHEDULE 



PREFETCH INSTRUCTION 
GENERATOR 



0 




101 



PREFETCH INSTRUCTION 
REPLACING SECTION 



111 



DEPENDENCY GRAPH 
GENERATOR 



112- 



V 



SOFTWARE PIPELINING 
SECTION 




113 



LOOP UNROLLING SECTION 



115 



1 



PREFETCH INSTRUCTION 
RECOVERY SECTION 




PREFETCH ADDRESS 
ADJUSTING SECTION 




■104 



INTERMEDIATE 
LANGUAGE 



■105 



DEPENDENCY 
GRAPH 



• 106 



INSTRUCTION 
SCHEDULE 



107 



INSTRUCTION 
SCHEDULE 

INSTRUCTION 
SCHEDULE 



108 



{J, FLOW OF CONTROL 
| FLOW OF DATA 



FIG. 6 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Sep. 7, 1999 



Sheet 6 of 16 



5,950,007 




Ml UNPROCESSED MEMORY 
REFERENCE INSTRUCTIONS 



703 




PREFETCH INSTRUCTION FOR 
MAKING REFERENCE TO THE 
ADDRESS WHICH IS ALSO 
REFERRED TO BY THE MEMORY 
REFERENCE INSTRUCTION Ml 
IS CREATED 



FIG. 7 



Q END ^) 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Sep. 7, 1999 



Sheet 7 of 16 



5,950,007 



(START) 901 

I j_ 

B SIZE OF CACHE BLOCK 
D -*-SIZE OF REFERENCE 
OBJECT ELEMENT 



FIG. 9 




NO 



904 



VPF — NEWLY 
GENERATED VIRTUAL 
PREFETCH INSTRUCTION 



NO 



906 



PF-*-PREFETCH INSTRUCTION 
IS SELECTED 



VIRTUAL PREFETCH 
INSTRUCTION VPF IS 
INSERTED INTO THE 
INTERMEDIATE LANGUAGE 



SELECTED PREFETCH INSTRUCTION 
IS SET CORRESPONDING TO THE VIRTUAL 
PREFETCH INSTRUCTION VPF 



-907 



905 



PREFETCH INSTRUCTION PF IS 
DELETED FROM THE 
INTERMEDIATE LANGUAGE 




911 



910 
YES 



z 



n-*-0 



( END ) 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Sep. 7, 1999 



Sheet 8 of 16 



5,950,007 



C START J 



1001 



B-*-SIZE OF CACHE BLOCK 

D — SIZE OF REFERENCE OBJECT 

ELEMENT 



FIG. 10 




VPI-*- COPY OF THE SAME 
VIRTUAL PREFETCH INSTRUCTION 
(0<i <m) 



PFj-*- ORIGINAL PREFETCH 
INSTRUCTION CORRESPONDING 
TO VPi 
(0*J < n) 



1005 




■1004 



( END ^ 



1006 



VPi IS REPLACED WITH PFj FOR 
EACH VPi. 
(j=i MOD (B/D)) 



,1007 



L 



VPi IS REPLACED WITH PFj WHEN 
0 3 < n FOR EACH VPi, AND 
VPi IS DELETED WHEN n Sj. 
(j=i MOD(B/D)) 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Sep. 7, 1999 



Sheet 9 of 16 



5,950,007 



( Vart) 



^-1101 



B^-SIZE OF CACHE BLOCK 

D-*- SIZE OF REFERENCE 
OBJECT ELEMENT 

L-*- EXECUTION CYCLE 
PER SINGLE LOOP 

M-«- CYCLE REQUIRED 
FOR DATATRANSFER 
TO CACHE FROM MAIN 
MEMORY 

a-«- MINIMUM INTEGER 
OFM/L+(B/D) OR MORE 



1102 



UNPROCESSED 
.PREFETCH INSTRUCTIONS, 
REMAIN? 

YES 



FIG. 11 



1103 



UNPROCESSED PREFETCH 
INSTRUCTION IS SELECTED 



I 



ADDRESS REFERRED TO BY THE 
PREFETCH INSTRUCTION PF 
IS SET AS THE ADDRESS 
REFERRED TO BYTHE ITERATION 
AFTER THE a-TIMES 



-1104 



( end} 



DO 101-0, N 

X(l) = A*X( I) + Y(l) 
10 CONTINUE 



FIG. 12 



01/29/2004, EAST Version: 1.4.1 



i 

U.S. Patent 



Sep. 7, 1999 



Sheet 10 of 16 



5,950,007 



to = x[u y 

„ — t1=A*tO 

FIG. 13 J2-YII] r- 

t 3 = 1 1 + t 2 
X [ i ] = 1 3 



1301 
1302 

1303 



1401 

fetch X [ i ] 
fetch Y [ i ] 

tO = X[i] ^^1402 

FIG 14 ti=A*to 

r,U ' f * t2 = Y[i] 

1 3 = 1 1 +t2 
X [ i ] = t 3 




01/29/2004, EAST Version: 1.4.1 



U.S. Patent Sep. i, im sheet 11 of 16 5,950,007 



INTEGER UNIT 



FLOATING POINT UNIT 



1604 



1605 











1 0 = X [ i ] 




t2 = Y[i] 






t1 =A* tO 



L-*fetchX[i + 11 ] 




s — fetch Y[i + 11 ] 


1 3 = t 1 +t2 


t0 = X[i + 1 ] 




1 2 = Y [ I + 1 ] 




x [ i ] = 1 3 


1 1 =A * t 0 



1601 

/ 



1602 
l 



/ 



if ( i + + < N ) goto L 









t 3 = 1 1 +t2 










X [ i ] = t 3 





FIG. 16 



1603 
I 
/ 

/ 

/ 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Sep. 7, 1999 



Sheet 12 of 16 



5,950,007 



INTEGER UNIT FLOATING POINT UN 

r 











tO = X[i] 




t2 = Y[i] 






t1 =A* tO 



if (i + + <N) goto L 



Lffetch X [ i + 11 ] 




fetch Y [ i + 1 1 ] 


1 3 = 1 1 + 1 2 


t0 = X[i + 1l 




1 2 = Y [ i + 1 ] 




X ( i ] = 1 3 


1 1 =AM0 



^fetchX[i + 12] 




— fetch Y[i + 12) 


t 3 = t 1 +t2 


tO = X[i +2] 




t2 = Y[i + 2] 




X [i + 1 ] = t3 


1 1 =A*tO 



— fetch X[i + 13] 




— fetchY[i + 13] 


t3=t 1 +t2 


tO = X[i + 3] 




t2 = Y[i + 3] 




X[i + 2] = t3 


t1 =A* tO 



— fetchX ti + 14] 




— fetchY[i + 14] 


t 3 = t 1 +t2 


t0 = X[i + 4] 




t 2 = Y [i +4] 




X [i + 3] = 1 3 


t1 =AM0 



,-1701 



-1702 









1 3 = 1 1 +t0 










X [ i] = 1 3 





,1703 



FIG. 17 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Sep. 7, 1999 



Sheet 13 of 16 



5,950,007 



INTEGER UNIT 



FLOATING POINT UNIT 





1 

| 




t 


1 0 = X [ i ] 


i 
1 


t 2 = X 1 1 J 


1 




t 1 = A tO , 


L : , fetch X \ i + 1 1 1 






t ^ - M + t P i 

l O — l 1 TIC- 


1 0 = X [ i +1 ] 




t2 = Y[i + 1 ] 




X[i]=t3 


t 1 =AM0 








t 3 = 1 1 + t 2 


tO = X[i + 2] 




t2 = Y[i + 2] 




X[i + 1 ] = t3 


t1 =A*tO 






— ^ fetch Y T i + 1 3 1 


t 3 = 1 1 + t 2 


tO = X[i + 3] 




t2 = Y[i + 3] 




X fi + 2 1 = t 3 


1 1 = A* tO 








1 3 = 1 1 +t2 


tO = Xf i + 41 




t2 = Yn + 41 




X[i + 3] = t3 


1 1 =A* tO 


If ( ( 1 + = 4 ) < N ) goto L 










t 3 = 1 1 +t2 










! X [ i ] = 1 3 





1802 



If 



1803 



FIG. 18 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Sep. 7, 1999 



Sheet 14 of 16 



5,950,007 



1901 



12 



1902 



cp^ 1903 J 



[ fetch X[i] N ; [ fetch Y[i] N , 



(tQ = X[i]) 

(t 1 =A'to) (t2«Y[i]) 

(t3 = t 1 +ti) 
1 — 

C X[ '^ 3 ) FIG. 19 



INTEGER UNIT FLOATING POINT UNIT 



2004- 



tO = X [ i ] 




t2 = Y[i] 






1 1 =A*t0 








L: t0 = X[i + 1 ] 


t 3 = t 1 +t2 


t2 = Y[i + 1 ] 




PREFETCH 


t1 =AM0 


X[i] = t3 






if ( i + + < N ) goto L 




1 3 = 1 1 +t2 










X [ i ] = t 3 





I 



I 

I 
I 

'J 



2001 



2002 



I r 2003 
l 



FIG. 20 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Sep. 7, 1999 



Sheet 15 of 16 



5,950,007 



INTEGER UNIT FLOATING POINT UNIT 



2104 



*1 



2105 



2106 



2107 



*1 



1 0 = X [ i ] 




t2 = Y[i] 






1 1 =A* tO 








L: t 0 = X [ i + 1 ] 


t 3 = 1 1 +t2 


t 2 = Y [ i + 1 ] 




PREFETCH 


1 1 =A* tO 


X [ i ] = t 3 




tO = X[i + 2] 


1 3 = 1 1 +t2 


t2 = Y[i + 2] 




PREFETCH 


1 1 = A* tO 


X[i + 1 ] = t3 




tO = X[i + 3] 


t 3 = 1 1 +t2 


t 2 = Y [ i + 3 J 




PREFETCH 


t 1 =A*t0 


X[i + 2] = t3 




tO = X[i + 4] 


t 3 = 1 1 +t2 


t2 = Y[i + 4] 




PREFETCH 


t 1 =A* tO 


X[i + 3] = t3 






if(i + = 4)<N) goto L 






1 3 = 1 1 +t2 










X [ i ] =t 3 






2101 



2102 



2103 



FIG. 21 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Sep. 7, 1999 



Sheet 16 of 16 



5,950,007 



INTEGER UNIT FLOATING POINT UNIT 



2204- 



2205- 



2206- 



2207- 



to = xm 




t2 = Y[i] 






t 1 = A tO 






L: t0 = X[i + 1 ] 


t 3 = 1 1 + t2 


t2 = Y[i + 1 ] 




fetch X[i + 15] 


t 1 =AM0 


X [i 1 = 1 3 




t n — y r i + ? l 

l U - A [ 1 t c J 


tQ-f1 4-tP 


t2 = Y[i + 2] 






t1 =A* tO 


Xfi + 1 ] = t3 




tO = X[i + 3] 


t 3 = 1 1 + t2 


t2 = Y[i + 3] 




fetch YM + 17] 


t1 =A* tO 


X [i + 3] = t3. 




tO = X[i + 4] 


1 3 = 1 1 +t2 


1 2 = Y r i + 4 1 






1 1 =A*t0 


X [ i + 3 ] = t 3 





if(i + = 4)<N) goto L 





1 3 = 1 1 +t2 | 










| X[i] = t3 





2201 



2202 



2203 



FIG. 22 



01/29/2004, EAST version: 1.4.1 



5,9i 

1 

METHOD FOR COMPILING LOOPS 
CONTAINING PREFETCH INSTRUCTIONS 
THAT REPLACES ONE OR MORE ACTUAL 

PREFETCHES WITH ONE VIRTUAL 
PREFETCH PRIOR TO LOOP SCHEDULING 
AND UNROLLING 

FIELD OF THE INVENTION 

The present invention relates to a data prefetch method 
and more specifically to a compile method which shortens 
the execution time of a program by prefetching data through 
scheduling of the prefetch instruction for a loop. 

BACKGROUND OF THE INVENTION 

The executing time of a program depends significantly on 
the waiting time generated by the dependent relationship 
between instructions and the waiting time generated by 
memory references. 

The waiting time generated by the dependent relationship 
between instructions within a loop can be considerably 
reduced by using a software pipelining scheduling method. 
Software pipelining as described, for example, in "Software 
Pipelining in PA-RISC Compiler" by S. Ramakrishnan, 
Hewlett-Packard Journal, pp. 39-45, 1992, reduces the 
waiting time generated by the dependent relationship 
between instructions and enhances the degree of parallelism 
in execution of instructions by overlapped execution of 
different iterations of the loop. The loop to which the 
software pipelining is applied is characterized by executing 
the code for initialization called a prologue before starting 
execution of the loop, executing the loop body by repeating 
code called a kernel, terminating the process by executing 
code called an epilogue when execution of the loop is 
completed, and starting execution of the subsequent iteration 
without waiting for the completion of the preceding itera- 
tion. 

It is rather difficult, in comparison with the waiting time 
generated by the dependent relationship between 
instructions, to reduce the waiting time associated with the 
memory references only with a software method. Therefore, 
in many computer systems, a high speed and small capacity 
memory called a cache memory is provided between the 
main memory and a processor to reduce the waiting time 
generated by a memory reference and thereby a high speed 
reference can be made on the cache memory to the data 
referred to recently. However, even when a cache memory is 
used, the waiting time is inevitably generated if a cache miss 
occurs while there is no recycle use of data. 

Therefore, as described, for example, in "Design and 
Evaluation of a Compiler Algorithm for Prefetching" by T. 
C. Mowry, et al., Proceedings of the 5th International 
Conference on Architectural Support for Programming Lan- 
guage and Operating Systems, pp. 62-73, 1992 for example, 
an attempt is made to reduce the waiting time generated by 
the memory references by utilizing an instruction for 
prefetching data from the main memory to the cache 
memory. 

SUMMARY OF THE INVENTION 

In the prior art, described above, software pipelining has 
been applied, as a method of scheduling a prefetch 
instruction, in such a manner that the prefetch instruction is 
issued before the iterations preceding a value that is a 
minimum integer times larger than the value obtained by 
dividing the delay time of the prefetch instruction with the 



;o,oo7 

2 

shortest path length of the loop body. However, the details 
for realizing such an application in such a manner are not yet 
described. 

It is therefore an object of the present invention to provide 
s an effective instruction scheduling method which can reduce 
the waiting time generated by a memory reference and the 
waiting time generated by the dependent relationship 
between instructions (inter-instruction dependency) while a 
program is executed in the loop including the prefetch 
10 instruction. 

In view of achieving the objects of the method of the 
present invention, the scheduling of the prefetch instruction 
for the loop in a program is executed in accordance with one 
of three types of methods at the time of compiling the 
15 program. 

The value of the data is not altered during the prefetch of 
data to the cache. Therefore, depending on the ordinary 
relationship between the definition and the use of the data, 
a dependency relationship does not exist between the 

20 prefetch of data to the cache and reference to the memory 
with a load instruction or store instruction. However, it is 
convenient and advantageous, because the existing sched- 
uling system can be applied directly, for hiding the waiting 
time due to the reference to memory, when a tacit dependent 

25 relationship is assumed to exist between the prefetch instruc- 
tion and the instruction for making reference to the memory 
due to the limitation that the memory reference instruction 
must be issued after completion of the data transfer to the 
cache by the prefetch instruction. Therefore, in method 1, 

30 the scheduling is performed by providing the dependent 
relationship between the prefetch instruction and the 
memory reference instruction. 
Method 1 

(1) The prefetch instruction is respectively issued for 
35 memory reference instructions which are assumed to gen- 
erate a cache miss. 

(2) A dependency graph having edges between the 
prefetch instructions generated in item (1) explained above 
and the corresponding memory reference instructions is 

40 generated. In this case, a delay between the prefetch instruc- 
tion and the memory reference instruction is set to a value 
larger than the number of cycles required for data transfer to 
the cache by the prefetch instruction so that the memory 
reference instruction is issued after the number of cycles 

45 required for data transfer to the cache by the prefetch 
instruction. 

(3) An instruction schedule is obtained by applying the 
software pipelining to the dependency graph generated in 
item (2) explained above. As explained above, software 

50 pipelining is a method that reduces the waiting time gener- 
ated by the inter-instruction dependency between instruc- 
tions by the overlapping execution of different iterations of 
the loop so that a sufficient interval can be provided between 
the prefetch instruction and the corresponding memory 

55 reference instruction. Thus, the scheduling is obtained by 
applying the software pipelining to the dependency graph 
generated in the item (2). 

White data transfer to the cache from the main memory is 
generally carried out in units of 32 bytes or 128 bytes, etc., 

60 the reference to the array in the loop is often carried out in 
smaller units, such as 4 bytes or 8 bytes, etc. Therefore, 
when a memory reference is continuously carried out for the 
array, etc. in the loop, the data for reference can often be 
moved to the cache from the main memory with a plurality 

65 of repetitive executions of a prefetch instruction. That is, if 
it is possible to move, to the cache from the main memory, 
the data for reference with N times of repetitive execution of 



01/29/2004, EAST Version: 1.4.1 



5,950,007 

3 4 

a prefetch instruction, it is enough that the prefetch instruc- ated in item (2) above is generated and the software pipe- 

tion is generated once for every N times of execution. lining is applied thereto. In the case of generating the 

For the schedule generated in method 1, since the prefetch dependency graph, it is no longer necessary to think about 

instruction is generated once for every iteration, many the dependent relationship between the virtual prefetch 

redundant prefetch instructions are generated. Therefore, the 5 instruction and the corresponding memory reference instruc- 

prefetch instructions are scheduled by unrolling the loop so t j or) 

that the redundant prefetch instructions are not generated (4) y 00 ^ [ s unrolled as required so that the number of 

frequently. times of unrolling of the kernel section formed in the item 

In a method (2), the kernel section of the loop including ^ explained above becomes equal to a multiple of the 

the software pipelined prefetch instructions generated by the 1Q number of times of iteration for fetcning data with one 

processings up to the item (3) from item (1) is unrolled ^ instruction In the after the unrollingj the 

unrolled to avoid issuing useless pre etch instructions by £ tch instnj ction indicates instruction slots , in any 

eliminating the redundant prefetch instructions. r i_* i_ *i. --1 t * u • * • • * a 

Method 2 one 0 W ° 1C " "* e on g ina l prefetch instruction is inserted. 

(4) Since it is sufficient to issue a prefetch instruction once (*) ™* / irt f ual F"^* instruction scheduled in the 
for every iteration of N times when the number of data ^ unrolled code of item (4) above is replaced with the ongmal 
prefetched by the one prefetch instruction, the kernel section P refetch instruction. This replacement is necessary to issue 
of the software pipelined schedule (item (3) explained the same prefetch instruction for every multiple of the 
above) is unrolled until the number of unrollings becomes a number of iterations for prefetching the data with one 
multiple of N. prefetch instruction. Thereby, issuing redundant prefetch 

(5) In the unrolled code of the item (4) explained above, 20 instructions can be controlled. 

the kernel section is unrolled for the number of times which (6) An address used for reference by the prefetch instruc- 
is equal to a multiple of N and the iteration is executed for tion replaced in the item (5) explained above is used as the 
the number of times of a multiple of N of the loop with only address of the data referred to subsequently in the comple- 
one iteration of the kernel section unrolled. Therefore, tion of the data transfer by means of the prefetch instruction, 
issuing useless prefetch instructions can be eliminated by 25 According to the method of the present invention, if 
deleting the redundant prefetch instructions from the reference to memory is not continuously executed, an inter- 
unrolled code so that the prefetch instruction is issued only val between the prefetch instruction and a memory reference 
once for an iteration of N times. instruction can be maintained sufficiently long by method 1 

In method 2, since the redundant prefetch instructions are in view of applying the software pipelining. Moreover, when 

removed after the software pipelining is applied to the 30 the memory reference is executed continuously, issuing 

prefetch instruction, the interval between the instructions redundant prefetch instructions can be controlled to effec- 

becomes shorter than that expected when the software tively perform the scheduling by removing the instruction 

pipelining is applied by deleting the prefetch instruction and after application of the software pipelining by means of 

thereby the waiting time caused by the inter-instruction method 2 or by applying the software pipelining through 

dependency may become visible. 35 replacement of a plurality of prefetch instructions with the 

In a method 3, a plurality of prefetch instructions are virtual prefetch instruction by means of method 3 and then 

replaced with one virtual prefetch instruction to generate a recovering the virtual prefetch instruction into the original 

dependency graph including such a virtual prefetch instruc- prefetch instruction. Thereby, the object of the present 

tion considering the unrolling of the kernel section after invention can be achieved, 

application of software pipeling to the loop. However, unlike 40 __ TCC nrcroTDTrnM rn: -rue nr> Awrwrc 

methods 1 and 2, the dependent relationship may not be DESCRIPTION OF THE DRAWINGS 

provided in method 3 between the virtual prefetch instruc- piG. 1 is a diagram of an instruction scheduler for 

tion and the corresponding memory reference instruction. scheduling the prefetch instructions. 

Next, software pipelining is applied to a dependency FIG. 2 is an example of a computer system in which the 

graph to obtain a software pipelined schedule and the loop 45 present invention is employed. 

is unrolled, as required, so that the number of times of piG 3 ^ aQ e k Qf a c ter in which the present 

unrolling of the kernel section becomes equal to a multiple invention ^ employed 

of the number of data which can be prefetched by one . f ■ , u a i„ *u ni 

prefetch instruction. The unrolled virtual prefetch instruc- ™- 4 > s * **S™ ° f ™ mSi ™ tl ™ ****** that 

iion is replaced with the initial prefetch instruction to adjust 50 executes scheduling according to method 1. 

an address referred to by the prefetch instruction so that the 5 is a dia S ram of an instruction scheduler that 

prefetch instructions are issued in an iteration sufficiently executes scheduling according to method 2. 

preceding the corresponding memory reference instruction. FIG. 6 is a diagram of an instruction scheduler that 

Thereby, the dependent relationship, between instructions executes scheduling according to method 3. 

generated by deleting the instruction in method 2 can be 55 FIG. 7 is a flowchart of a prefetch instruction generator, 

reduced. FIG. 8 is a flowchart of a prefetch instruction deleting 

Method 3 section. 

In accordance with method 3 of the present invention, the FIG. 9 is a flowchart of a prefetch instruction replacing 

following steps are executed. section 

(1) TTic prefetch instructions are generated respectively 60 1 flowchart tch instnJCtion re 

for the memory reference instructions which are assumed to r * 

* ■ section, 

generate a cache miss. _„ _ . _ t e . ^ , , , Jt . 

(2) The prefetch instructions generated in item (1) ^IG. 11 is a flowchart of a prefetch address adjusting 
explained above are grouped into a plurality of groups and section. 

are then replaced with virtual prefetch instructions. 65 FIG. 12 is an example of a FORTRAN source program. 

(3) A dependency graph composed of an instruction of an FIG. 13 is an example of the intermediate language 
original loop body and virtual prefetch instructions gener- generated in compiling the source program of FIG. 12. 



01/29/2004, EAST Version: 1.4.1 



5,95 

5 

FIG. 14 is an example of the intermediate language of 
FIG. 13 which includes the prefetch instructions. 

FIG. 15 is an example of a dependency graph for the 
intermediate language of FIG. 14 which includes the 
prefetch instructions according to method 1. 

FIG. 16 is an example of the software-pipelined schedule 
obtained by applying software pipelining to the dependency 
graph of FIG. 15 according to method 1. 

FIG. 17 is an example of the unrolled schedule obtained 
by method 2. 

FIG. 18 is an example of the schedule of FIG. 17 having 
the redundant prefetch instructions deleted according to 
method 2. 

FIG. 19 is an example of a dependency graph generated 
according to method 3. 

FIG. 20 is an example of the software -pipelined schedule 
obtained by applying software pipelining to the dependency 
graph of FIG. 19 according to method 3, 

FIG. 21 is an example of the unrolled schedule according 
to method 3. 

FIG. 22 is an example of the schedule obtained following 
replacement of the prefetch instructions in method 3. ' 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENTS 

A preferred embodiment of the present invention will be 
explained with reference to the accompanying drawings. 

FIG. 2 illustrates an example of a computer to which the 
method of the present invention is applied. In this example, 
a compiler, which in a preferred embodiment is constituted 
by a combination of software stored on a storage medium 
202, such as a hard disk or other storage device, and 
hardware, such as a computer (CPU) 201 that executes the 
software to perform the function of the compiler, operates on 
the CPU 201 and reads the source code 203 from an external 
memory 202, converts it into the object code 204 and then 
stores the object code into the external memory 202. 

FIG. 3 illustrates an example of a computer in which the 
data prefetch method of the present invention is employed. 
In the case of executing an ordinary memory reference 
instruction with the CPU 301, it is first checked whether the 
reference object data is in the cache 302. When such data 
exists in the cache 302, reference is made to such data. If 
there is no such data as the reference object, reference is 
made to the relevant data in the main memory 303 and a 
cache block to which the relevant data belongs is placed in 
the cache 302. Reference to the cache is made at a high 
speed in comparison with the reference to the main memory 
and when the data as the reference object is found to be in 
the cache, the waiting time generated by the memory ref- 
erence is reduced. 

The prefetch instruction is used for moving the cache 
block that the data of the reference object belongs to into the 
cache 302 from the main memory 303 simultaneously with 
the execution of the other instructions. Since the other 
instructions can be executed during the transfer of the data 
to the cache 302 from the main memory 303 by issuing the 
prefetch instruction beforehand by a number of cycles 
sufficient for movement of the cache block to the cache 302 
from the main memory 303, the waiting time for making 
reference to the relevant data can be eliminated. 

FIG. 1 illustrates a diagram providing an example of the 
present invention. In FIG. 1, a scheduling processor 101 
inputs an intermediate language 102 for a loop body and 
outputs an instruction schedule 103 including the prefetch 



0,007 

6 

instructions and having a reduced amount of delay caused by 
inter-instruction dependency and a reduced amount of wait- 
ing time resulting from memory reference. Processings 117 
and 118 (software program processings) are characteristic 

s processings of the present invention. In the processing 117, 
the generation of prefetch instructions and the preprocessing 
of the scheduling are executed, while in the processing 118, 
the removal of redundant prefetch instructions and 
postprocessing, such as adjustment of prefetch addresses, 

10 are performed. 

First, an embodiment for scheduling the loop with method 
1 will be described. FIG. 4 is a diagram of an instruction 
scheduler for scheduling the prefetch instruction with 
method 1. In method 1, the prefetch instruction generator 

15 109 inputs an intermediate language 102 to provide the 
•intermediate language 104 having the added prefetch 
instructions by generating the prefetch instructions for the 
memory reference instructions which are assumed to have a 
high possibility for generating a cache miss among those 

20 included in the intermediate language 102 included in the 
loop body. 

Here, the possibility for generating a cache miss of a 
certain memory reference instruction can be estimated 
according to the known prior art described in "Design and 

25 Evaluation of a Compiler Algorithm for Prefetching" by T. 
C. Mo wry et al, Proceedings of the 5 th International Con- 
ference on Architectural Support for Programming Lan- 
guage and Operating Systems, pp. 62-73, 1992, for 
example, along with a trace of the program execution. The 

30 addresses prefetched by the prefetch instructions generated 
are assumed to be those of the corresponding memory 
reference instructions. 

Namely, if a load instruction, LOAD X[i], in the loop is 

35 assumed to easily generate a cache miss, an instruction for 
prefetching the same element, FETCH X[i], is generated and 
it is then added to the intermediate language. 

Next, the dependent graph generator 111 inputs the inter- 
mediate language 104 including the prefetch instruction to 

40 generate a dependency graph 105. In this case, an edge 
indicating that a delay required between the prefetch instruc- 
tion and the corresponding memory reference instruction is 
longer than the time required for transferring the cache block 
to the cache from the main memory is provided between the 

45 prefetch instruction and the corresponding memory refer- 
ence instruction. Next, the software pipelining is applied to 
the dependency graph 105 in the software pipelining section 
112 to obtain the software pipelined instruction schedule 
103. 

50 As explained above, since it is guaranteed that an interval 
between the prefetch instruction and the corresponding 
memory reference instruction is set to the time required for 
transfer of a cache block to the cache from the main memory 
at the time of application of the software pipelining by 

55 generating the dependency graph having an edge between 
the prefetch instruction and the corresponding memory 
reference instruction to indicate that the necessary delay is 
longer than the time required for transfer of the cache block 
to the cache from the main memory, the prefetch instruction 

60 can be scheduled to hide the latency due to the memory 
reference. 

The prefetch instruction generator 109 explained above 
will be further explained with reference to an operation 
flowchart shown in FIG. 7. First, in step 701, whether any 
65 memory reference instructions to be processed remain or not 
is judged. When such instruction is left, control skips to step 
702. When there is no such instruction, processing is com- 



01/29/2004, EAST Version: 1.4.1 



5,9f 

7 

pleted. In step 702, the memory reference instruction to be 
processed is selected and is then stored in a variable MI. In 
step 703, whether the memory reference instruction stored in 
MI has a high possibility for generating a cache miss or not 
is judged. When such possibility for generating a cache miss 
is high, the control skips to step 704. When such possibility 
is low, control skips to step 701 to process the next memory 
reference instruction. In step 704, the prefetch instruction for 
making reference to the address that is the same as that of the 
memory reference instruction is stored in MI is generated. 

Next, an embodiment for scheduling the loop by method 
2 will be explained. In method 2, the following processings 
are also executed in addition to the processings of method 1 . 
First, the kernel section of software pipelined instruction 
schedule 106 obtained by the processing of method 1 is 
unrolled for a plurality times in the loop unrolling section 
113 to obtain the instruction schedule 107. The number of 
times of development is set, for example, to be the least 
common multiple of B/D and N, when a size of a cache 
block which can be moved to the cache from the main 
memory by execution of the one prefetch instruction is 
defined as B, a size of the element referred by the memory 
reference instruction as D and an increment of the array 
reference element as N. 

When a loop is unrolled, the redundant prefetch instruc- 
tions are deleted subsequently from the instruction schedule 
107 obtained by unrolling of the loop by the prefetch 
instruction deleting section 114. Thereby, the final instruc- 
tion schedule 103 not including the redundant prefetch 
instruction can be obtained. In regard to the deletion of the 
redundant prefetch instructions, since it is enough that the 
prefetch instruction is once issued for every B/D times, the 
other instructions are deleted so that the prefetch instruction 
is issued once every B/D times for each unrolled prefetch 
instruction. 

In the method explained above, the number of times of 
loop unrolling is increased in some cases. Therefore, when 
it is required to keep low the number of times of the loop 
unrolling, for example, the loop is unrolled for an adequate 
number of times and the other instructions are deleted so that 
the prefetch instruction is generated in every other iteration 
of the B/D times. Thereby, a few redundant prefetch instruc- 
tions may be issued, but an increase in the number of times 
of unrolling can be prevented. 

Operations of the prefetch instruction deleting section 114 
in the above explanation will now be explained with refer- 
ence to the flowchart shown in FIG. 8. First, in step 801, a 
size of the cache block is set to a constant B, while a size of 
the reference object element as a constant D. In step 802, 
whether unprocessed prefetch instructions remain or not is 
judged. When these instructions exist, the control is shifted 
to step 803 and when there is no such instructions, the 
processing is completed. In step 803, the same unprocessed 
prefetch instructioas are copied by the loop unrolling section 
113 in FIG. 1 and are sequentially assigned to variables PFi 
(O^i^n). In step 804a, if there exists a remainder: where i 
is divided by (B/D) (i mod (B/D))»0) for O^i^n, namely if 
i is not an integer multiple of B/D, the prefetch instruction 
PFi is deleted in step 8046 and the control is shifted to step 
802 to process the next prefetch instruction. Thereby, the 
prefetch instruction is issued once for every other iteration 
of multiples of B/D. The "MOD" function is recognized in 
PASCAL computer programming language to be the func- 
tion that computes the remainder of division. 

Next, an embodiment of scheduling the loop by method 3 
will be explained. First, a prefetch instruction for the 



;o,oo7 

8 

memory reference instruction having a high possibility for 
generating a cache miss is generated from the intermediate 
language 102 as an input in the prefetch instruction genera- 
tor 109 as in the case of method 1 to obtain the intermediate 

5 language 104 having the added prefetch instruction. 

Next, a plurality of prefetch instructions generated by the 
prefetch instruction generator 109 are formed into groups 
and such groups are replaced with virtual prefetch instruc- 
tions in the prefetch instruction replacing section 110. In this 

10 replacement, for example, the virtual prefetch instructions 
for such a minimum integer number as is larger than 
M/(B/D) is generated when a size of a cache block which 
can be moved to the cache from the main memory by 
execution of one prefetch instruction is defined as B, a size 

is of the element referred with the memory reference instruc- 
tion as D and the number of prefetch instructions included 
in the intermediate language 104 as M and the prefetch 
instruction generated once for every other instruction of B/D 
number corresponds to one virtual prefetch instruction. 

20 When the virtual prefetch instruction is generated, the origi- 
nal prefetch instruction in the intermediate language 104 is 
deleted and a newly generated virtual prefetch instruction is 
added. 

Next, the dependency graph generator 111 generates a 

25 dependency graph 105 from an input of the intermediate 
language 104. In this case, unlike methods 1 and 2, there is 
not any dependent relationship between the virtual prefetch 
instruction and memory reference instruction. Subsequently, 
the software-pipelined instruction schedule 106 is obtained 

30 by applying the software pipelining to the loop in the 
software pipelining section 112, Since the dependent rela- 
tionship is not provided in method 3, unlike methods 1 and 
2, between the prefetch instruction and corresponding 
memory reference instruction, a high degree of freedom for 

35 instruction array can be assured for application of the 
software pipelining. 

Next, the software-pipelined instruction schedule 106 is 
unrolled for several times in the loop unrolling section 113 
to obtain an instruction schedule 107. The number of times 

40 of unrolling is set to the least common multiple, for example, 
of B/D and N, as in the case of method 2, when a size of the 
cache block which can be moved to the cache from the main 
memory by execution of one prefetch instruction is defined 
as B, a size of the element referred by the memory reference 

45 instruction as D and an increment of the array reference 
element as N. When the loop unrolling processing by the 
loop unrolling section 113 is completed, the virtual prefetch 
instruction included on the obtained instruction schedule 
107 is recovered, in the prefetch instruction recovery section 

50 115, to the corresponding prefetch instruction replaced in the 
prefetch instruction replacing section 110. To a certain 
virtual prefetch instruction VP, n prefetch instructions PFI, 
PF2, . . . ,PFn correspond, and when it is assumed that the 
virtual prefetch instruction VP is unrolled into m virtual 

55 prefetch instructions VP1, VP2, . . . ,VPm by the loop 
unrolling section 113, such recovery processing is 
performed, for example, as explained hereunder. 

In the case where n=B/D, when j=i mod(B/D), VPi is 

60 replaced with PFj. 

In the case where n<B/D, when j«i mod(B/D), VPi is 
replaced with PFj, if l^j^n, and VPi is deleted if n<j. 

As a result, an instruction schedule 108 consisting of the 
original prefetch instruction can be obtained. 

65 Next, an instruction schedule 103 not including redundant 
prefetch instructions can be obtained by adjusting the ref- 
erence object address of the prefetch instruction of the 



01/29/2004, EAST Version: 1.4.1 



5,950,1 

9 

instruction schedule 108 so that the data is prefetched, in the 
prefetch address adjusting section 116, with the iteration 
which occurs sufficiently later for completion of data trans- 
fer by the prefetch instruction. 

This address adjustment is performed as explained here- 5 
under when the prefetch instruction, FETCH X[i], is issued, 
for example, for the array X. 

That is, it is enough when the array element, which is 
referred to with the iteration for the number of times of a 
minimum integer which is equal to or larger than M/L+(B/ 10 
D), where the number of cycles required for a single 
execution of the scheduled loop is defined as L and the 
number of cycles required for transfer of a cache block of the 
object data to the cache from the main memory with the 
prefetch iastruction as M, is prefetched. 15 

That is, when the number of times of iteration is defined 
as a, it is enough to adjust the reference address of the above 
prefetch instruction to FETCH X[i+a]. 

Hereafter, the processings executed by the prefetch 2Q 
instruction replacing section 110 and prefetch instruction 
recovery section 115 in method 3 will then be explained with 
reference to the flowchart. 

FIG. 9 is an operation flowchart of the prefetch instruction 
replacing section 110 in FIG. 1. First, in step 901, a size of 25 
a cache block is set to a constant B, a size of the reference 
object element to a constant D and a value of variable n for 
recording the number of prefetch instructions to 0. In step 
902, it is determined whether or not any prefetch instructions 
still remain. When these instructions remain, the control 30 
skips to step 903. When there is no such instruction, the 
processing is completed. In step 903, whether a value of the 
variable n is 0 or not is judged. When the result is YES, the 
control skips to step 904. If the result is NO, the control skips 
to step 906. In step 904, a new virtual prefetch instruction is 35 
generated and it is then stored in the variable VPF. In step 
905, the virtual prefetch instruction stored in the variable 
VPF is inserted into the intermediate language stream. 

In step 906, the prefetch instruction is selected and stored 
in the variable PF. In step 907, the prefetch instruction 40 
recorded in the variable PF is provided to correspond to the 
virtual prefetch instruction recorded in the variable VPF. In 
step 908, the prefetch instruction recorded in the variable PF 
is deleted from the intermediate language stream. In step 

909, the variable n shows an increment of 1 (one). In step 45 

910, whether the value of n is equal to (B/D) or not Is judged. 
When the result is YES, the control skips to step 911. When 
the result is NO, the control skips to step 902 to process the 
next prefetch instruction. In step 911, the control skips to 
step 902 to process the next prefetch instruction. Thereby, 50 
the prefetch instruction is replaced once with the virtual 
prefetch instruction for every other (B/D) prefetches. 

FIG. 10 is an operation flowchart of the prefetch instruc- 
tion recovery section 115 in FIG. 1. First, in step 1001, a size 
of the cache block is set to a constant B and a size of the 55 
reference object element as a constant D. In step 1002, it is 
determined whether or not any virtual prefetch instructions 
still remain. When these instructions remain, the control 
skips to step 1003. When there is no such instruction, the 
processing is completed. In step 1003, the similar virtual 60 
prefetch instructions copied by the loop unrolling section 
113 in FIG. 1 are sequentially stored in the variable VPi 
(O^iim). In step 1004, the original prefetch instructions 
corresponding to VPi are stored in the variable PFj (0^j<n). 

In step 1005, whether the number n of prefetch instruc- 65 
lions PFj is (B/D) or not is judged. When the result is YES, 
the control skips to step 1006 and when the result is NO, the 



10 

control skips to step 1007. In step 1006, when j=i MOD(B/ 
D) for each VPi, VPi is replaced with PFj and the control 
skips to step 1002 to process, the next virtual prefetch 
instruction. In step 1007, when j«i MOD(B/D) for each VPi, 
VPi is replaced with PFj if 0^j<n, while VPi is deleted if 
n^j and the control skips to step 1002 to process the next 
virtual prefetch instruction. Thereby, the virtual prefetch 
instruction is recovered as the original prefetch instruction 
and each prefetch instruction is repeated once in every other 
B/D instructions. 

FIG. 11 is an operation flowchart of the prefetch address 
adjusting section 116 in FIG. 1. First, in step 1101, a size of 
the cache block is set to a constant B, a size of the reference 
object element to a constant D, the number of execution 
cycles per loop to L, the number of cycles required for data 
transfer to the cache from the main memory to M and the 
number of iterations a to precedently issue the prefetch 
instruction to the minimum integer equal to or larger than 
M/L+(B/D). In step 1102, it is determined whether or not 
any unprocessed prefetch instructions still remain. When 
these instructions remain, the control skips to step 1103. 
When none of these remain, the processing is completed. In 
step 1103, the unprocessed prefetches are selected and are 
stored in the variable PR In step 1104, the address referred 
by the prefetch instruction stored in the variable PF is 
changed to the address referred after the iteration of a times. 
Thereby, the prefetch instruction is issued sufficiently before 
generation of the memory reference instruction and the 
waiting time due to the memory reference can be hidden. 

Subsequently, the effect of the scheduling by an embodi- 
ment of each method will be explained using practical 
examples. FIG. 12 is an example of a loop of the FORTRAN 
program used for explanation about an embodiment. The 
intermediate language shown in FIG. 13 can be constituted 
from the loop body of this program. An example of the 
schedule of the prefetch instruction in each method when 
such intermediate language is used as an input is indicated 
hereunder. 

In the example of FIG. 13, memory reference is per- 
formed with the instructions 1301, 1302, 1303, but since the 
same address is referred to by the instructions 1301 and 
1303, one prefetch instruction is respectively generated for 
the arrays X and Y. In this example, a super-scalar type 
processor is assumed to execute in parallel the memory 
reference instruction, prefetch instruction and arithmetic 
instruction. However, the present invention can be applied 
not only to the super-scalar type processor but also to the 
sequential type processor and very long instruction word 
(VLIW) processor. In the following examples, it is assumed 
that the data to be used for the iteration of four times can be 
transferred to the cache with a single prefetch instruction 
and data transfer to the cache from the main memory 
requires 50 cycles. 
Method 1 

(1) Generation of the prefetch instructions 

The prefetch instructions are generated for the arrays X 
and Y. The intermediate language having the added prefetch 
instructions is shown in FIG. 14. In this figure, the instruc- 
tions 1401 and 1402 are respectively prefetch instructions 
for the arrays X and Y 

(2) Generation of dependency graph 

FIG. 15 illustrates a dependency graph for an intermediate 
language having the added prefetch instructions. In this 
figure, a node indicates an instruction and an arrow between 
the nodes indicates the dependent relationship. A numeral 
added to the right side of each arrow indicates the number 
of cycles for separating instructions. As shown in this figure, 



01/29/2004, EAST Version: 1.4.1 



5,9f 

11 

a dependent relationship having a delay of 50 cycles 
required for the transfer of data to the cache from the main 
memory is provided between the prefetch instruction 1501 
for the array X and the load instruction 1503 for the array X; 
and between the prefetch instruction 1502 for the array Y 
and the load instruction 1504 for the array Y. 

(3) Software pipelining 

The software pipelining is applied to the dependency 
graph of FIG. 15. The software-pipelined schedule is shown 
in FIG. 16. The schedule shown in FIG. 16 is composed of 
a prologue section 1601 for initializing the loop, a kernel 
section 1602 for repeating the loop and an epilogue section 
1603 for processing to terminate the loop. Each entry of FIG. 
16 indicates the instruction slots corresponding to each 
entry. The prefetch instructions are assigned to the instruc- 
tion slots 1604 and 1605 and are scheduled to be executed 
by the software pipelining in the iteration 10 times before the 
corresponding memory reference instruction. Since the 
schedule which satisfies dependent relationship between 
instructions has been obtained by the software pipelining, 
the waiting time generated by the memory reference can be 
eliminated. 
Method 2 

In the embodiment of method 1, two prefetch instructions 
are generated for a single iteration. Since the data used for 
an iteration of four times can be prefetched with the prefetch 
instruction, it is useless to issue the prefetch instruction for 
each processing. Therefore, the generation of useless 
prefetch instructions can be controlled, in method 2, by 
applying the following processing to method 1. 

(4) Loop development 

The kernel section of the software-pipelined loop consti- 
tuted in item (3) of method (3) is unrolled. Since it is 
assumed, in this embodiment, that the data referred to by the 
iteration of four times can be transferred to the cache with 
a single prefetch operation, it is enough when the prefetch 
instruction 1 is generated once for every iteration of four 
times. Therefore, the schedule indicated in FIG. 17 can be 
obtained by unrolling the kernel section four times. The 
schedule shown in FIG. 17 is composed of the prologue 
section 1701, unrolled kernel section 1702 and epilogue 
section 1703. The prefetch instruction for the array X in the 
kernel section 1702 is unrolled to the instruction slots 1704, 
1706, 1708 and 1710, while the prefetch instruction for the 
array Y is unrolled to the instruction slots 1705, 1707, 1709 
and 1711. 

(5) Deletion of redundant prefetch instructions 

The redundant prefetch instructions for the arrays X and 
Y are deleted so that one prefetch instruction is generated for 
every iteration of four times for the instruction schedule of 
FIG. 17 obtained by development of the loop. Thereby, 
generation of useless prefetch instructions can be controlled 
and the schedule shown in FIG. 18 can be obtained. In FIG. 
18, the redundant prefetch instructions 1805, 1806, 1807, 
1808, 1810 and 1811 are deleted by the prologue section 
1802 and the data for iteration of four times can be effec- 
tively prefetched by the respective prefetch instructions of 
the instruction slot 1804 for the array X and the instruction 
slot 1809 for the array Y 
Method 3 

In method 3, the prefetch instructions are scheduled as 
explained hereunder, considering that useless prefetch 
instructions are never issued. 

(1) Generation of prefetch instructions 

The generation of the prefetch instructions is executed in 
the same manner as for method 1. 

(2) Replacement of prefetch instruction and generation of 
dependency graph 



0,007 

12 

A plurality of prefetch instructions generated in item (1) 
explained above are grouped to form virtual prefetch 
instructions in order to constitute a dependency graph. The 
obtained dependency graph is shown in FIG. 19. As shown 

s in FIG. 19, the prefetch instruction 1901 for the array X and 
the prefetch instruction 1902 for the array Y are combined 
and are then replaced with the virtual prefetch instruction 
1903. Unlike methods 1 and 2, no dependent relationship is 
provided, in method 3, between the virtual prefetch instruc- 

10 tion and the corresponding memory reference instruction. 

(3) Software pipebning 

The software pipelining is applied to the loop body to 
which the virtual prefetch instruction is added. As a result, 
the software -pipelined schedule can be obtained as shown in 
is FIG. 20. The schedule shown in FIG. 20 is composed of the 
prologue section 2001, kernel section 2002 and epilogue 
section 2003. To the instruction generating slot 2004 of the 
kernel section 2002, the virtual prefetch is scheduled. 

(4) Loop development 

20 Like the case of method 2, the kernel section of the 
software-pipelined schedule constituted in item (3) 
explained above is unrolled for four times. Thereby, the 
schedule shown in FIG. 21 can be obtained. The schedule 
shown in FIG. 21 is composed of the prologue section 2101, 

25 unrolled kernel section 2102 and epilogue section 2103. The 
virtual prefetch instruction scheduled in the kernel section 
2103 by the loop unrolling section is copied into the instruc- 
tion generating slots 2104, 2105, 2106 and 2107. 

(5) Recovery of prefetch address 

30 The virtual prefetch instruction unrolled for the instruc- 
tion slots 2104, 2105, 2106 and 2107 of the kernel section 
2102 of FIG. 21 is replaced with the original prefetch 
instruction. The result is shown in FIG. 22, Since the virtual 
prefetch instruction unrolled for the instruction slots 2104, 

35 2105, 2106 and 2107 of FIG. 21 is obtained by replacing the 
prefetch instruction for the arrays X and Y, the prefetch 
instruction is inserted into the instruction slots 2204 and 
2206 of FIG. 22 so that the prefetch instruction for respec- 
tive array is generated once for every iteration of four times. 

40 In this case, since the number of original prefetch instruc- 
tions is less than the number of times of iteration for making 
reference to the data which can be transferred to the cache 
from the main memory with only a single prefetch operation, 
the instruction slots 2205 and 2207 for the unrolled virtual 

45 prefetch instruction are maintained as the idle slots. 

(6) Adjustment of prefetch address 

The address as an object of the prefetch is adjusted so that 
an interval as great as the number of cycles sufficient for 
termination of the transfer of a cache block to the cache from 

50 the main memory is maintained between issuance of the 
prefetch instruction and the issuance of the instruction for 
making reference to the data transferred to the cache by the 
prefetch instruction. Since the cycle required for the transfer 
of the cache block to the cache from the main memory is 50 

55 cycles and four cycles are required for a single iteration, the 
reference destination of a prefetch instruction is changed 
here so that the data referred to after iterations of 14 times 
is prefetched as shown in FIG. 22. 
As explained above, the instruction schedule 103 includ- 

60 ing the prefetch instruction can be generated by the sched- 
uler 101 using the intermediate language of FIG. 1 as an 
input. That is, when the reference to data is not continuously 
performed in the iteration of the loop, since the memory 
reference instruction corresponding to the prefetch instruc- 

65 tion can be issued with an interval as long as the number of 
cycles required for the transfer of data to the cache from the 
main memory by utilizing method 1, the waiting time due to 



01/29/2004, EAST Version: 1.4.1 



5,950,007 



13 



14 



the memory reference can be hidden. Moreover, when the 
reference to data is continuously performed, issuance of the 
redundant prefetch instructions can be controlled by utiliz- 
ing methods 2 and 3. In addition, since the dependent 
relationship is not provided, in comparison with method 2, 
between the virtual prefetch instruction and the memory 
reference instruction in method 3, the degree of freedom for 
the array of instructions increases and since the software- 
pipelining is applied considering the development of the 
kernel section, generation of delay time due to the dependent 
relationship between the instructions can be kept at a mini- 
mum. 

Although the embodiments of the invention have been 
described in relation to schedulers, sections, generators, 
unrolling sections and the like, it is understood that these 
components of the invention are embodied by software 
stored in computer memory or on memory storage media for 
execution by a computer and that the software is executed to 
enable the methods to be performed by a computer, such as 
a general purpose computer. 

According to the present invention, the waiting time due 
to memory reference, etc. during execution of the programs 
can be reduced by effectively scheduling the prefetch 
instruction. Thereby, the present invention is very effective 
for high speed execution of the computer programs. 

Namely, according to the present invention, if reference to 
the memory is not performed continuously, the software 
pipelining can be applied by method 1 keeping sufficiently 
long intervals between the prefetch instructions and the 
memory reference instructions. Moreover, when a reference 
to the memory is performed continuously, the instruction is 
deleted after application of the software pipelining by 
method 2 or a plurality of prefetch instructions are replaced 
with the virtual prefetch instructions by method 3 for 
application of the software pipelining and thereafter such 
virtual prefetch instructions are recovered to the original 
prefetch instructions to control the issuance of the useless 
prefetch instructions in view of realizing the effective sched- 
ule. 

We claim: 

1. A data prefetch method in a compiler for compiling 
programs to be executed on a computer having a prefetch 
instruction for transferring data to a cache memory from a 
main memory in parallel with execution of other 
instructions, comprising: 

(a) converting a source program in a loop of a program 
into intermediate code; 

(b) replacing a plurality of prefetch instructions included 
in a loop of a program into one virtual prefetch instruc- 
tion independent of memory reference; 

(c) generating a dependency graph having edges and 
showing a required delay between said virtual prefetch 
instruction and an instruction for memory reference in 
accordance with said intermediate code that is longer 
than a time required to transfer the data of the virtual 
prefetch instruction to the cache memory from the main 
memory; 

(d) executing instruction scheduling by applying software 
pipelining, for scheduling instructions to hide latency 
between instructions by execution through overlap of 
different iterations of the loop, to said dependency 
graph; and 

(e) unrolling the obtained schedule for a plurality of times 
to replace said unrolled virtual prefetch instruction with 
a plurality of initial prefetch instructions. 

2. A data prefetch method according to claim 1, wherein 
said step (e) further comprises adjusting the address which 



25 



30 



35 



40 



45 



50 



60 



is referred to by said replaced prefetch instruction to the 
address which is referred to by the iteration which is 
sufficiently later to complete the data transfer by said 
prefetch instruction. 

3. A data prefetch method in a compiler for compiling 
programs to be executed on a computer having a prefetch 
instruction for transferring data to a cache memory from a 
main memory in parallel with execution of the other 
instructions, comprising: 

(a) converting a source program in a loop of a program 
into intermediate code; 

(b) replacing a plurality of prefetch instructions included 
in a loop of a program into one virtual prefetch instruc- 
tion independent of memory reference; 

(c) generating a dependency graph having edges and 
showing a required delay between said virtual prefetch 
instruction and an instruction for memory reference in 
accordance with said intermediate code that is longer 
than a time required to transfer the data of the virtual 
prefetch instruction to the cache memory from the main 
memory; 

(d) executing instruction scheduling by applying a soft- 
ware pipelining, for scheduling instructions to hide 
latency between instructions by execution through 
overlap of different iterations of the loop, to said 
dependency graph; and 

(e) unrolling said instruction scheduling, 
wherein said step (e) further comprises: 

(el) unrolling the obtained schedule for a plurality of 
times; and 

(e2) replacing said unrolled virtual prefetch instruction 
with a plurality of initial prefetch instructions. 

4. A data prefetch method according to claim 3, wherein 
said step (e) further comprises adjusting the address which 
is referred to by said replaced prefetch instruction to the 
address which is referred to by the iteration which is 
sufficiently later to complete the data transfer by said 
prefetch instruction. 

5. A compile program stored on a computer readable 
storage medium executing a data prefetch method on a 
computer having a prefetch instruction for transferring data 
to a cache memory from main memory in parallel with 
execution of the other instructions, comprising: 

(a) converting a source program in a loop of a program 
into intermediate code; 

(b) replacing a plurality of prefetch instructions included 
in a loop of a program into one virtual prefetch instruc- 
tion independent of memory reference; 

(c) generating a dependency graph having edges and 
showing a required delay between said virtual prefetch 
instruction and an instruction for memory reference in 
accordance with said intermediate code that is longer 
than a time required to transfer the data of the prefetch 
instruction to the cache memory from the main 
memory; 

(d) executing instruction scheduling by applying a soft- 
ware pipelining, for scheduling instructions to hide 
latency between instructions by execution through 
overlap of different iterations of the loop, to said 
dependency graph; and 

(e) unrolling said instruction scheduling, 
wherein said step (e) further comprises: 

(el) unrolling the obtained schedule for a plurality of 
times; and 

(e2) replacing said unrolled virtual prefetch instruction 
with a plurality of initial prefetch instructions. 



01/29/2004, EAST Version: 1.4.1 




5,950,007 

15 16 

6. A compile program according to claim 5, wherein said sufficiently later to complete the data transfer by said 
step (e) further comprises adjusting the address which is prefetch instruction, 
referred to by said replaced prefetch instruction to the 

address which is referred to by the iteration which is ***** 



01/29/2004, EAST Version: 1.4.1 



United States Patent [19] 

Iitsuka 



US005862385A 
[ii] Patent Number: 
[45] Date of Patent: 



5,862,385 
Jan. 19, 1999 



[54] COMPILE METHOD FOR REDUCING 
CACHE CONFLICT 

[75] Inventor: Takayoshi Iitsuka, Sagamihara, Japan 

[73] Assignee: Hitachi, Ltd., Japan 

[21] AppL No.: 861,187 

[22] Filed: May 21, 1997 

Related U.S. Application Data 

[63] Continuation of Ser. No. 303,007, Sep. 8, 1994. 
[30] Foreign Application Priority Data 

Sep. 10, 1993 [JP] Japan 5-250014 

[51] Int. CI. 6 G06F 9/45 

[52] U.S. CI 395/709; 395/707; 395/708 

[58] Field of Search 395/701, 703, 

395/704, 705, 706, 707, 708, 709, 445, 
460, 461 

[56] References Cited 

U.S. PATENT DOCUMENTS 

5,142,634 8/1992 Fite ct ai 395/375 

5,337,415 8/1994 DeLane et al 395/375 

5,442,790 8/1995 Nosenchuck 395/700 

5,481,751 1/1996 Alpert et al 395/800 

5,488,729 1/1996 Vegesna et al 395/800 

5,497,499 3/1996 Garg et al 395/800 

OTHER PUBLICATIONS 

Jouppi, N., "Improved Direct-Mapped Cache Performance 
by the Addition of a Small Fully-Associative Cache and 
Prefetch Buffers", Computer Architecture, 1990 Interna- 
tional Symposium, IEEE, pp. 363-373, May 1990. 



Gosmann et al, "Code Reorganization for Instruction 
Caches", System Sciences, 1993 Annual Hawaii Int'l Con- 
ference IEEE pp. 214-223, Jan. 1993. 

Pyster, A., "Compiler Design and Construction", 1980, pp. 
11, Jan. 1990. 

Leonard, D, "The Moral Dilemma of Decompilers", Data 
Based Advisor, vlO n8 p. 38(3), Aug. 1992. 

Aho, Alfred V. et al., Compilers (©1985, Addison Wesley 
Publishing Co.), pp. 464-473. 

Hennessy, J. et al., Computer Architecture: A Quantitative 
Approach ( ©1990 Morgan Kaufmann Publishers, Inc., Palo 
Alto, California), pp. 408^25. 

Krishnamurthy, Sanjay M., "A Brief Survey of Papers on 
Scheduling for Pipelined Processors", ACM SIGPIAN 
Notices, vol. 25, No. 7, Jul. 1990, pp. 97-106. 

Primary Examiner—. James P. Trammell 
Assistant Examiner — Kakali Chaki 

Attorney, Agent, or F/rm— Michaelson & Wallace; Peter L. 
Michaelson; John C. Pokotylo 



[57] 



ABSTRACT 



A compiling method, for use with programs to be executed 
on a computer with cache memory, which programs would 
otherwise generate decreased performance due to cache 
conflicts arising from conflicting cache access(es), for 
reducing the generation of such cache conflicts) by reor- 
dering the order of memory reference code in the program 
such that conflicting memory references do not start before 
the completion of memory accesses to a memory block. 

6 Claims, 30 Drawing Sheets 



Cache conflict analysis 



Cache reference group 
analysis 



Cache conflict graph 
analysis 





No cache conflict 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Jan. 19, 1999 Sheet 1 of 30 5,862,385 



FIG. 1 




01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Jan. 19, 1999 



Sheet 2 of 30 



5,862,385 



FIG. 2 (PRIOR ART) 



Source Program 



1< 



COMMON /BLK1/A.B.C 

REAL A(256,1 28),B(256,1 28,3),C(256,1 28) 

DO 10 J=1,128 

DO 20 1=1,255 

C(I.J) = A(I,J) * B(I,J,2) + A(I+1,J) * B(I,J,3) + B(I,J,1) 

20 CONTINUE 
10 CONTINUE 



FIG. 3 (PRIOR ART) 



(128 kbytes) 



A(1:256,1:128) 



B(1:256,1:128,2) 



0(1:256,1:128) 



110 Cache 



(128 kbytes) 



1 



120 Main memory 



8(1:256,1:128,1) 



6(1:256,1:128,3) 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Jan. 19,1999 sheet 3 of 30 5,862,385 



FIG. 4 (PRIOR ART) 



110 Cache 



/ 111 ^113 



n it A 



A(1:4,J) 



B(1:4,J,2) 



A n u 



C(1:4,J) 



A(5:8,J) 



B(5:8,J,2) 



n n ,i 



A(9:12,J) 



B(9:12,J,2) 



C(5:8,J) C(9:12,J) 



120 Main memory 



FIG. 5 (PRIOR ART) 



f \U r 115 r 116 










« « • 


i 




, i 






















120 Main 

f 


















B(1:4,J,1) 




B(5:8,J,1) 




B(9:12,J,1) 




• • • 












B(1:4,J,3) 


B(5:8,J,3) 


B(9:12,J,3) 


* • • 





01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Jan. 19, 1999 Sheet 4 of 30 



5,862,385 



FIG. 6 (PRIOR ART) 



61 



62 



Object Program 
64 



63 



65 



Instruction 
number Label 



Instruction Operand Processing contents 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 
12 

13 



loop: 



flood 

flood 

fmpy 

flood 

flood 

fmpy 

fadd 

flood 

fadd 

fstore 

addi 

cmp 

ble 



A(r1,J),fr1 

B(r1,J,2),fr2 

fr1,fr2,fr2 

A(r1+U),fr3 

B(r1,J,3)fr4 

fr3.fr4.fr4 

fr2.fr4.fr4 

B(r1,J.1),fr5 

fr4.fr5.fr5 

fr5,C(r1,J) 

r1.#1.r1 

M,#255 

loop 



frl = A(I,J) 
fr2 = B(I,J,2) 
fr2 = frl * fr2 
fr3 = A(I+1,J) 
fr4 = B(I.J,3) 
fr4 = fr3 * fr4 
fr4 = fr2 + fr4 
fr5 = B(I,J,1) 
fr5 = fr4 + fr5 
C(I,J) = fr5 
M = rl + 1 
compare rl and 255 

if rl = 255 goto loop 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Jan. 19, 1999 



Sheet 5 of 30 



5,862,385 



FIG. 7 (PRIOR ART) 



71 



72 



73 



74 



75 



70 



Referenced 

Instruction array Hit Accepted Rejected 
value number element status block block 



76 



Cache miss 
caused by 
cache conflict 



I = 1 



= 2 



I = 3 



1 = 4 



1 

1 


All ,Jj 


Miss 


L 


DM 1 0\ 
Dl 1 ,J,ZJ 


Miss 


4 


A 2,J) 


Miss 


5 


Bf 1.J.3) 


Miss 


8 


B U1) 


Miss 


10 


C 1,J) 


Miss 


1 


A 2,J) 


Miss 


2 


B (2,1,2) 


Miss 


4 


A(3,J) 


Miss 


5 


B 2.J.3) 


Miss 


8 


B(2,J,1) 


Miss 


10 


C 2,J 


Miss 


1 


A(3,J) 


Miss 


2 


B 3.J.2) 


Miss 


4 


A 4,J) 


Miss 


5 


B 3.J.3) 


Miss 


8 


B(3,J,1) 


Miss 


10 


C(3,J) 


Miss 


1 


A 4,J) 


Miss 


2 


B 4.J.2) 


Miss 


4 


A5,J) 


Miss 


5 


B 4.J.3) 


Miss 


8 


B(4,J,1) 


Miss 


10 


C(4,J) 


Miss 
• 



A(1:4,J) 
B 1:4,J,2) 
A 1:4,J) 
B 1:4,J,3) 
B(1:4,J,1) 
C(1:4,J) 
A 1:4,J) 
B(1:4,J,2) 
A 1:4,J) 
B 1:4,J,3 
B 1:4,J,1 
C 1:4,J 
A 1:4,J 
B 1:4,J,2) 
A 1:4,J) 
B 1:4,J,3) 
B(1:4,J,1) 
C(1:4,J' 
A 1:4,J 
B 1:4,J,2) 
A5:8,J) 
B(1:4,J,3) 
B 1:4,J,1) 
C(1:4,J) 



A(1:4,J) 
B(1:4,J,2) 



YES 



B(1 


4.J.3) 




A 1 


4,J 




CI 


4,J 


YES 


A(1 


4,J) 


YES 


B(1 


4.J.2) 


YES 


B 1 


4.J.1 


YES 


Bf 1 


4,J,3) 


YES 


A(1 


4,J 


YES 


CI 


:4,J 


YES 


A 1 


4,J) 


YES 


B 1 


4.J.2) 


YES 


B 1 


4.J.1 


YES 


B 1 


4.J.3) 


YES 


A(1 


4,J 


YES 


C 1 


:4,J 


YES 


A(1 


.4,J) 


YES 


B(1 


•4.J.1) 


YES 


B 1 


4.J.3) 


YES 


A(1 


:4,J) 


YES 



01/29/2004, EAST 



Version : 



1.4.1 



U.S. Patent Jan. 19, 1999 Sheet 6 of 30 5,862,385 



FIG. 8 



Cache conflict analysis 



30 



1 



32 



Cache reference group 
analysis 



34 



Cache conflict graph 
analysis 





Cache conflict exists 



1 ,39 




Cache conflict 
information 



1 



9900 



Cache reference 
group information 



9500 



Cache conflict 
graph 



No cache conflict 



01/29/2004, EAST version: 1.4.1 



U.S. Patent Jan. 19,1999 Sheet 7 of 30 5,862,385 



FIG. 9A Cache Reference Group Analysis 

32 



Repeat statement Si in the 
intermediate code from 
start to end 



n = 0 



-3202 



3204 




3208 




3210 



n = n+1; m = n 



3212 



Create number m cache reference group, i.e. CAG(m); 
register statement Si to CAG(m) 



FIG. 
9A 



FIG. 
9B 



FIG. 9 



01/29/2004, EAST Version: 1.4.1 



1 i .1 



U.S. Patent Jan. 19,1999 Sheet 8 of 30 5,862,385 



A--3214 



Repeat statement Sj in 

the intermediate code from 

after Si to end ^>~\ ^3216 

Does Sj ^-L^ p 

reference the memory 
? 



X 



3218 



Calculate distance d between the Sj memory reference and CAG(m). 



3220 





Re-register all memory references currently registered in CAG(m) 
to the cache reference group of Sj. 



X 



3228 



n = n-1 



X 



3230 



Set the reference group number 
of Sj to m. 



/3224 



Register Sj 
to CAG(m) 



FIG. 9B 




3232 



3234 
3236 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Jan. 19, 1999 Sheet 9 of 30 



5,862,385 



FIG. 10 



/32t8 
/32 52 



Obtoin the address calculation expression 
At] of the memory 
referenced by Sj 



L 



3254 



d = Dmox \ l| 



Repeat for each memory reference 



Ai registered in CAG(m) 



A^~3256 



3258 



Extract the address calculation expression 
AEi from memory reference Ai 



JL 



3260 



diff = AEi - AEj 



3262 






T 

y-3266 


d = obs(diff) 







d = diff 



-3268 



Ac 



3270 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Jan. 19, 1999 Sheet 10 of 30 



5,862,385 



FIG. 11 



Cache Conflict Graph Analysis 



L 



3402 



Node generation in cache conflict graph 



Repeat i = 1 to 'the number 

of cache reference groups 1 

Repeat i = i+1 to 'the number 

of cache reference groups' 



3404 



3406 



3408 



Obtain distance dc between cache reference groups 
CAG(i) and CAG(j) 



3410 





F 



_[_ 



3414 



Place an edge between nodes i and j 
in the cache conflict graph. 




-3416 

-3418 
-3420 



34 




01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Jan. 19, 1999 Sheet 11 of 30 



5,862,385 



FIG. 12 



3408 
3452 



dc = DCmox + 1 



Repeat for each memory 
reference Ai registered in 
CAG(i) 



if 



3454 



JL 



3456 



Obtain distance diffc 
between Ai and CAG(j). 





01/29/2004, EAST Version: 1.4.1 



U.S. Patent Jan. 19, 1999 Sheet 12 of 30 



5,862,385 



FIG. 13 



Reordering of Memory References 

d 1 



Write cache reference group numbers 
to intermediate code. 



52 



Determine loop unrolling method. 



53 



Loop unrolling 







Variable name renaming 




f 55 


Common expression elimination 
of array elements 




4 



Cache conflict reduction through 
reordering of intermediate code 



5.0 



Cache conflict 
information 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Jan. 19, 1999 



Sheet 13 of 30 



5,862,385 



FIG. 14 



Loop Unrolling Method Determination 



Loop unrolling candidate table generation 

Repeat each cache reference 
group CAG(i) 




Cache conflict 
information 



Assign to variable AE 
the address calculation expression of one 

memory reference in CAG(i) 



I 



Of the loops for which loop unrolling is possible, 
when the loop control variable is ii ana the increment 
is inc. the loop for which abs(AE [li+inc] -AE[ ii ]) 
is minimum is the candidate loop for unrolling. 



528 



Select a maximum of all such unrolling counts 

as the candidate loop unrolling count 
for the candidate loop for unrolling so that, after 
loop unrolling, all the memory references in CAG(i) are 
included in one memory block 



530 



Write the cache reference group number, unrollinq 
loop candidate and loop unrolling count candidate for 
this candidate loop to the loop unrolling candidate fable. 



r 



532 




f 



534 



536 

S 



Select the particular loop to be unrolled as the 
loop which appears, among the loop candidates 84, 
the most often in the loop unrolling cadidate table. 



538 

SI 



T 



52 



521 



Loop unrolling 
candidate table 



Determine the number of times the particular loop is to be 
unrolled as the largest number among the loop unrolling 
candidates count entries 86 for this loop. 




8 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Jan. 19, 1999 



Sheet 14 of 30 



5,862,385 



FIG. 15 

Cache Conflict Reduction Through Reordering of Intermediate Code 



570 



Allocation of statements in intermediate code 
to schedule table 



Data dependency graph generation 

\ 



rj>74 



Priority determination by critical path method 



Time slot t = 1 



576 



578 



Of statement not yet allocated to the schedule table, 
detect collected SRs for those which 
can be executed at time slot t. 




SRorig = SR —582 



584 



Remove from SR memory reference 
intermediate code belonging to cache 
reference groups which conflict with 
cache reference groups for which only a 
portion of memory references is 
allocated to the schedule table 



t = t + 1 
y 596 



I 



SR = SRorig 



•590 



57 




Allocate the one highest priority statement in SR 
to time slot T in the schedule table. 



I 



Reorder the intermediate code according to the 
statement order of the schedule table. 



-598 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 

FIG. 16 

5 



Intermediate 
Code 



Jan. 19, 1999 

305 310 



Statement 
number Label 



Sheet 15 of 30 

320 

\ 

Three address 
statement 



5,862,385 



350 



Reference group 
number 



1 

1 


loflfV 


T1 = A(l J) 




L 




TO - Q(\ J 2) 




\ 
u 




T3 = T1 * T2 




4 




T4 = A(l+U) 




5 




T5 = B(I,J,3) 




6 




T6 = T4 * T5 




7 




T7 = T3 + T6 




8 




T8 = B(I,J,1) 




9 




T9 = T7 +T8 




10 




C(I,J) = T9 




11 




1 = 1 + 1 




12 




if IS 255 goto loop 





FIG. 17 



Cache 

Conflict 

Information 



9900 



^Y-9510 

,>4rr 9520 

9590 ^ 



r 



9500 



Cache 

conflict 

graph 



Cache reference group information 



I U-9020 
.9060 L — r QftQn 
Qnm / .yuyu 
/ yu4U Referenced Address/ 
Statement array calculation 
number element expression 








■ 

• 


• 


• 
• 











01/29/2OO4, EAST Version: 1.4.1 



U.S. Patent 



Jan. 19, 1999 



Sheet 16 of 30 



5,862,385 



FIG. 18 



9500 Cache conflict graph 



Cache 
Conflict <| 
Information 
9 




,9900 


Cache reference qroup information 


/ 9010 


[TV 9020 


1 


A(U) 


Starting address of A + 4*(l-1 + 256*(J-1)) 


4 


A(I+1,J) 


Starting address of A + 4*(l+1-1 + 256*(J-1)) 


^9110 


| 2 f-— 9021 


2 


B(I,J,Z) 


Starting address of B 


+ 4*(l-1 + 256*(J-1) + 256*1 2B*(2-l)) 


y-9210 


fTl^ 9022 


5 


B(I,J,2) 


Starting address of B 


+ 4*(H + 256*(J-1) + 256*128*(3-1)) 


/-9310 


[TV- 9023 


8 


B(I,J,2) 


Starting address of B 


+ 4*(l-1 + 256*(J-1) + 256*1 28*(3-1)) 



/ 9410 



rrk9024 


10 


C(I,J) 


Starting address of B + 4*(l-1 + 256*(J-1)) 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 

FIG. 19 

5. 



Jan. 19, 1999 



305 



310 



Sheet 17 of 30 

Intermediate Code 
320 



5,862,385 



330 



1 


loop! 


Tl - h(\ \\ 
1 1 - A^I,Jy 


i 

1 


9 
L 






0 

L 


0 




T7 - Tl * TO 
1 J — 11 1 L 




A 

H 




1A - Afl+1 


1 
i 


5 




T5 = B(I,J,3) 


3 


6 




T6 = T4 * T5 




7 




T7 = T3 + T6 




8 




T8 = B(I,J,1) 


4 


9 




T9 = T7 +T8 




10 




C(I,J) = T9 


5 


11 




I = 1 + 1 




12 




if 1 = 255 goto loop 





FIG. 22A 



Loop Unrolling Candidate Table 

l \ 84 

Cache ^ 
Reference Unrolling 

Group Loop Loop Unrolling 



Number Candidate 



86 



Count Candidate 



FIG. 22B 



Loop Unrolling Candidate Table 



82 

r J . 


84 
i 


f 


1 


00 20 1=1,255 


3 


2 


DO 20 1=1,255 


4 


3 


DO 20 1=1,255 


4 


4 


DO 20 1=1,255 


4 


5 


DO 20 1=1,255 


4 



01/29/2004, EAST version: 1.4.1 



U.S. Patent 



Jan. 19, 1999 



Sheet 18 of 30 



5,862,385 



FIG. 20A 



305 

i 



310 



Intermediate Code 
320 



330 



1 


loop: 


T1 = A(I,J) 


1 


2 




T2 = B(I,J,2) 


2 


3 




T3 = T1 * T2 




4 






1 
1 


5 




tc n/ 1 1 7 \ 

T5 = B(I,J,3J 


7 

O 


6 




TC — T A t TC 

Tb = 14 * Id 




7 




T7 = T3 + T6 




8 




to _ n/i i i \ 
18 - D^JJ J 


A 

4 


9 




T9 = T7 +TB 




10 




C(I,J) = T9 


c 

a 


11 




T A A i f \ , A 1 \ 

Til = A(I+1,J) 


i 


12 




T12 = B(l+U2) 


2 


13 




T13 = T11 * T12 




14 




T14 = A(I+2,J) 


1 


15 




T15 = B(l+],J,3j 


3 


1 C 
I 0 








17 




T17 = T13 + T16 




18 




T18 = B(I+1,J,1) 


4 


19 




T19 = T17 + T18 




20 




C(I+1,J) = T19 


5 


21 




T21 = A(I+2,J) 


1 


22 




T22 = B(I+2,J,2) 


2 


23 




T23 = T21 * T22 




24 




T24 = A(I+3,J) 


1 



FIG. 
20 A 

na 

20B 

FIG. 20 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Jan. 19, 1999 Sheet 19 of 30 



5,862,385 



FIG. 20B 

305 



310 



Intermediate Code 
320 



330 



25 




T25 = B(I+2,J,3) 


3 


26 




T26 = T24 * T25 




27 




T27 = T23 + T26 




28 




T28 = B(I+2,J,1) 


4 


29 




T29 = T27 + T28 

1 mm W 1 f ■ m^ 




30 




C(I+2,J) = T29 


5 


31 




T31 = Afl+3 J) 

1 v 1 nl 1 1 vjw / 


1 


32 




T32 = B(I+3,J,2) 


2 


33 




T33 = T31 * T32 




34 




T34 = A(I+4,J) 


1 


35 




T35 = B(I+3,J,3) 


3 


36 




T36 = T34 * T35 




37 




T37 = T33 + T36 




38 




T38 = B(I+3,J,1) 


4 


39 




T39 = T37 + T38 




40 




C(I+3,J) = T39 


5 


41 




1 = 1 + 4 




42 




if 1 ^ 252 goto loop 





01/29/2004, EAST Version: 1.4.1 



U.S. Patent Jan. 19, 1999 Sheet 20 of 30 5,862,385 



FIG. 21A 



305 



310 



Intermediate Code 
320 



330 



1 


loop: 


T1 = A(I,J) 


1 


2 




T2 = B(I.J,2) 


2 


3 




T3 = T1 * T2 




4 




T4 = A(I+1,J) 


1 


5 




T5 = B(I,J,3) 


3 


6 




T6 = T4 * T5 




7 




T7 = T3 + T6 




8 




T8 = B(I,J,1) 


4 


9 




T9 = T7 +T8 




10 




C(I,J) = T9 


5 


11 




T12 = B0+U2) 


2 


12 




T13 = T4 * T12 




13 




T14 = A(I+2,J) 


1 


14 




T15 = B(I+1,J,3) 


3 


15 




T16 = T14 ♦ T15 




16 




T17 = T13 + T16 




17 




T18 = B0+U1) 


4 


18 




T19 = T17 + T18 




19 




C(I+1,J) = T19 


5 


20 




T22 = B(I+2,J,2) 


2 


21 




T23 = TH * T22 




22 




T24 = A(l+3,J) 


1 


23 




T25 = B(I+2,J,3) 


3 


24 




T26 = T24 * T25 





FIG. 
2TA 

RGL 
21B 

FIG. 21 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Jan. 19, 1999 Sheet 21 of 30 



5,862,385 



FIG. 21B 



Intermediate Code 



305 



310 



320 



330 



25 




T27 = T23 + T26 




26 




T28 = B(I+2,J,1) 


4 


27 




T29 = T27 + T28 




28 




C(I+2,J) = T29 


5 


29 




T32 = B(I+3,J,2) 




30 




T33 = T24 *T32 


2 


31 




T34 = A(I+4,J) 


1 


32 




T35 = B(I+3,J,3) 


3 


33 




T36 = T34 * T35 




34 




T37 = T33.+ T36 




35 




T38 = B(I+3,J,1) 


4 


36 




T39 = T37 + T38 




37 




C(I+3,J) = T39 


5 


38 




I = 1 + 4 




39 




if 1 ^ 252 goto loop 





01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Jan. 19, 1999 Sheet 22 of 30 



5,862,385 



FIG. 23 



Data Dependency Graph 



24 



242 



TI=A(I,J) [ 9,1 | 2 | T2=B(I, J,2) [9,2 [ 4 • |T4=A(I+1,J) [ 9,1 |5|T5=B(U3) [ 9,3 



. 12 l2 i 
244 J3[T3=T1«T2 ] 7 

246 



E 



6|T6=T4*T5 | 7 



2^j7 |T7=T3+T6 |s [ 8 | T8=B(I,J,1) [ 5,4 



2 



9 T9=T7+T8 
[10|C(I,J)=T9 [ 1,5 



11 



T12=B(I+1,J,2) [ 9,2 [13|TU=A(I+2,J) | 9,l [14|T15=B(I+1,J,3) [ 9,3 
12 U |2 



12 T13=T4*T12 [7 J^ [l5|T16=T14*T15 [ 7 

2jl6|T17=T13+T16 |5 [17| T1 8=B(I+ 1 ,J f 1) [5,4 



2 



18 T19=T17+T18 3 
E 



19 C(I+1,J)=T19 1,5 



20lT22=B(l+2,J,2T| 9,2 [22|T24=A(I+3,J) [ 9,1 [23|T25=B(I+2,J,3) ] 9,3 

J^ [24|T26=T24*T25 | 7 

2^|25|T27=T23+T26 |5 |26|T28=B(I+2,J,1) |5,4 




27 T29=T27+T28 

2 |28|C(I+2,J)=T29 [ 1,5 

29|T32=B(l+3,J,27l9,2 [31 |T34=A(IHJ) [ 9,1 [32|T35=B(I+3,J,3) [ 9,3 



30 T33=T24*T32 



2^ |35|T36=T34*T55 j 7 



r |34|T37=T33+T36 ,[ 5 [35[ T38=B(H-3.J,1)| 5, 4 



2 



36 T39=T37+T38 



37 



C(1+3,J)=T39 1,5 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Jan. 19, 1999 Sheet 23 of 30 



5,862,385 



FIG. 24A 



26 



303 

Time 

Slot 

Value 



Schedule Table 

320 

Three ^ 
Statement Address 
Number Statement 



305 



330 



Reference 

Group 

Number 



1 


1 


Tl = A(I,J) 


1 


2 


4 


T4 = A(I+1,J) 


1 


3 


5 


T5 = B(I,J,3) 


3 


4 


13 


T14 = A(I+2,J) 


1 


5 


14 


T15 = B(l+1,J,3) 


3 


6 


22 


T24 = A(I+3,J) 


1 


7 


23 


T25 = B(I+2,J,3) 


3 


8 


31 


T34 = A(I+4,J) 


1 


9 


32 


T35 = B(I+3,J,3) 


3 


10 


2 


T2 = B(I,J,2) 


2 


11 


11 


T12 = B(I+1,J,2) 


2 


12 


20 


T22 = B(I+2,J,2) 


2 


13 


29 


T32 = B(I+3,J,2) 


2 


14 


3 


T3 = Tl * T2 




15 


6 


T6 = T4 * T5 




16 


12 


*M7 — Ti * Til 
113 - 14 * \\l 




17 


15 


T16 = T14 * T15 




18 


21 


T23 = T14 * T22 




19 


24 


T26 = T24 * T25 




20 


30 


T33 = T24 • T32 




21 


33 


T36 = T34 4 T35 




22 


7 


T7 = T3 + T6 




23 


8 


T8 = B(I,J,1) 


4 


24 


16 


T17 = T13 + T16 





FIG. 
24A 

Tia 

24B 

FIG. 24 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Jan. 19, 1999 sheet 24 of 30 5,862,385 



FIG. 24B 



Schedule Table 



26 



303 

( 

\ 


305 

( 


320 
( 


330 
I 

1 


25 


17 


T18 = B(I+1,J,1) 


4 


26 


25 


T27 = T23 +T26 




27 


26 


T28 = B(l+2,J,1) 


4 


28 


34 


T37 = T33.+ T36 




29 


35 


T38 = B(I+3,J,1) 


4 


30 


9 


T9 = T7 + T8 




31 


18 


T19 = T17 + T18 




32 


27 


T29 = T27 + T28 




33 


36 


T39 = T37 + T38 




34 


10 


C(I.J) = T9 


5 


35 


19 


C(I+1,J) = T19 


5 


36 


28 


C(I+2,J) = T29 


5 


37 


37 


C(I+3,J) = T39 


5 


38 


38 


1 = 1 + 4 




39 


39 


if 1 ^ 252 goto loop 





01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Jan. 19, 1999 



Sheet 25 of 30 



5,862,385 



FIG. 25A 

303 



5- 



Intermediate Code 
310 320 

1 



330 



1 


loop: 


T1 = A(I,J) 


1 


2 




T4 = A(I+1,J) 


1 


3 




T5 = B(I,J,3) 


3 


4 




T14 = A(I+2,J) 


1 


5 




T15 = B(I+1,J,3) 


3 


6 




T24 = A(I+3,J) 


1 


7 




T25 = B(I+2,J,3) 


3 


8 




T34 = A(I+4,J) 


1 


9 




T35 = B(l+3,J,3) 


3 


10 




T2 = B(I,J,2) 


2 


11 




T12 = B(I+1,J,2) 


2 


12 




T22 = B(I+2,J,2) 


2 


13 




T32 = B(I+3,J,2) 


2 


14 




T3 = T1 * T2 




15 




T6 = T4 * .T5 




1 b 




I 1 J — 14 * 11/ 




17 




T16 = T14 * T15 




18 




T23 = T14 * T22 




19 




T26 = T24 * T25 




20 




T33 = T24 * T32 




21 




T36 = T34 * T35 




22 




T7 = T3 + T6 




23 




T8 = B(I,J,1) 


4 


24 




T17 = T13 + T16 





FIG. 
25A 

na 

25B 

FIG. 25 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Jan. 19,1999 Sheet 26 of 30 5,862,385 



FIG. 25B Intermediate Code 

303 310 320 330 



25 




T18 = B(I+U1) 


4 


26 




T27 = T23 +T26 




27 




T28 = B(I+2,J,1) 


4 


28 




T37 = T33.+ T36 




29 




T38 = B(I+3,J,1) 


4 


30 




T9 = T7 + T8 




31 




T19 = T17 + T18 




32 




T29 = T27 + T28 




33 




T39 = T37 + T38 




34 




C(I,J) = T9 


5 


35 




C(I+1,J) = T19 


5 


36 




C(I+2,J) = T29 


5 


37 




C(I+3,J) = T39 


5 


38 




1 = 1 + 4 




39 




if 1* 252 goto loop 





01/29/2004, EAST Version: 1.4.1 



U.S. Patent 

FIG. 26A 

61 



Jan. 19, 1999 Sheet 27 of 30 
Object Program 



62 63 
Instruction 

Number Label Instruction 



64 

i 

Operand 



1 loop: 


flood 


Afrl.J .frl 


2 


flood 


A(r1 + 1,J)fr2 


3 


flood 


B(r1,J,3),fr3 


4 


flood 


A(r1+2,J),fr4 


5 


flood 


B(r1+1,J,3),fr5 


6 


flood 


A(r1+3,J),fr6 


7 


flood 


B(r1+2,J,3),fr7 


8 


flood 


A(r1+4,J),fr8 


9 


flood 


B(r1 +3,J,3),fr9 


10 


flood 


B(rU,2),fr10 


11 


flood 


B(r1 +1 ,J,2),fr1 1 


12 


flood 


B(r1+2,J,2),fr12 


13 


flood 


B(rH3,J,2),fr13 


14 


fmpy 


fr1,fr10,fr10 


15 


fmpy 


fr2, fr3,fr3 


16 


fmpy 


fr2,fr11,fr11 


17 


fmpy 


fr4, fr5,fr5 


18 


fmpy 


fr4, fr12,fr12 


19 


fmpy 


fr6,fr7,fr7 


20 


fmpy 


fr6.fr13.fr13 


21 


fmpy 


fr8.fr9.fr9 


22 


fadd 


fr10.fr3.fr3 


23 


flood 


B(r1.J,l).fr14 


24 


fadd 


fr11.fr5.fr5 



5,862,385 

"Fia 

26 A 

na 

26 B 

FIG. 26 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Jan. 19,1999 Sheet 28 of 30 5,862,385 



FIG. 26B 



Object Program 



Instruction 1 
Number Label 


63 

\ 

Instruction 


64 

U*r 

I 

Operand 


25 


flood 


B(r1+1,J,1),fr15 


26 


fadd 


fr12,fr7,fr7 


27 


fload 


B(rl+z,J,i;,trlb 


28 


fadd 


tri o,try,Try 


29 


tload 


J-X I 1 *\ t r \ 7 

b^ri t3,j,i j,im / 


30 


tnriA 

tada 




31 


fadd 


fr5,fr15,fr15 


32 


fadd 


fr7,fr16,fr16 


33 


fadd 


fr9,fr17,fr17 


34 


f store 


. f.r14,C(r1,J) 


35 


fstore 


fr15,C(r1 + 1,J) 


36 


fstore 


fr16,C(M+2,J) 


37 


fstore 


fr17,C(r1+3,J) 


38 


add 


r1,#4,r1 


39 


cmp 


r1,#252 


40 


ble 


loop 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Jan. 19, 1999 



Sheet 29 of 30 



5,862,385 



FIG. 27 



70 



72 



73 



( 71 > ( 
I \ Referenced I, 

I Instruciion array Hit 

Number number element status 



74 

\ 

Accepted 
block 



75 76 

) Cache miss 
Rejected caused by 

block cache conflict 





- 1 


A(U) 


Miss 


A(1:4,J) 






2 


A(2,J) 


Hit 








3 


B(U3) 


Miss 


B(1:4,J,3) 






4 


A(3,J) 


Hit 








5 


B(2,J,3) 


I Pi 

Hit 








6 


A(4,J) 


Hit 








7 


B(3,J,3) 


i pi 
Hit 








8 


A(5,J) 


Miss 


A(5:8,J) 






9 


B(4,J,3) 


Hit 






1 = 1 


10 


B(U2) 


Miss 


B(1:4,J,2) 


A(1:4,J) 


11 


B(2,J,2) 


t iti 
Hit 






1 = 4 


12 
13 


B(3,J,2) 
B(4,J,2) 


Hit 
Hit 








23 


B(1.J,1) 


Miss 


B(1:4,J,1) 


B(1:4,J,3) 




25 


B(2,J,1) 


Hit 








27 


B(3,J,1) 


Hit 








29 


B(4,J,1) 


Hit 








34 


C(1,J) 


Miss 


C(1:4,J) 


B(1:4,J,2) 




35 


C(2,J) 


Hit 








36 


C(3,J) 


Hit 








37 


C(4,J) 


Hit 







01/29/2004, EAST Version: 1.4.1 



U.S. Patent Jan. 19,1999 sheet 30 of 30 5,862,385 



FIG. 28 




FIG. 29 



280 
J— 



CPU 



ALU 



Cache memory 



Main memory 



•291 



110 



120 



01/29/2004, EAST Version: 1.4.1 



5,8< 

1 

COMPILE METHOD FOR REDUCING 
CACHE CONFLICT 

CROSS REFERENCE TO RELATED 
APPLICATION 

This application is a continuation of copending patent 
application Ser. No, 08/303,007, filed Sep. 8, 1994, now 
abandoned entitled "Compile Method for Reducing Cache 
Conflict." 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

This invention concerns a method of compiling programs 
which are executed on computers having cache memory, and 
more specifically, such a method which generates code that 
reduces, by as much as possible, cache conflicts which 
would otherwise occur due to conflicting cache access 
during execution of these programs. 

2. Description of the Prior Art 

Cache memory and cache conflict are described in works 
such as J. Hennessy and D. Patterson, Computer Architec- 
ture: A Quantitative Approach, (O1990 Morgan Kaufmann 
Publishers, Inc., Palo Alto, Calif.), pages 408-425. 

Cache memory (referred to hereafter simply as "cache") 
is a type of rapidly accessible memory which is used 
between a processing device and main memory. By placing 
a copy of one portion of data residing in main memory into 
cache, data referencing speed, i.e., the speed at which this 
data is accessed, can be increased. The units of data trans- 
ferred between cache and main memory are called "blocks"; 
blocks in cache are called "cache blocks", while blocks in 
main memory are called "memory blocks". 

As to methods of implementing caches, there are three 
mapping methods which place copies of memory blocks into 
cache blocks: the full associative method, the set associative 
method, and the direct map method. Recently, to provide 
caches with enhanced capacity and increased speed, the 
direct map method is coming into mainstream use. 

With the direct map method, the cache blocks, into which 
memory blocks are mapped, are uniquely determined. Since 
cache capacity is generally smaller than main memory 
capacity, multiple memory blocks are mapped into one 
cache block. Because of this, even if a piece of data has been 
transferred to a cache block, if the data that was previously 
written into that cache block is now rejected from the cache 
block due to a reference of another memory block now 
mapped in the same cache block, a cache miss will be 
generated with the next reference. 

This phenomenon is called "cache conflict"; cache misses 
generated by this phenomenon are called "cache misses 
caused by cache conflict". One drawback of the direct map 
method is that with certain programs, a substantial amount 
of cache conflict is generated, causing a marked drop in 
performance. 

FIGS. 2 through 7 show cache conflict in detail. Hereafter, 
and for purposes of illustration and simplicity, I will assume 
the use of the direct map method with a block length of 16 
bytes and a cache capacity of 256Kbytes (256x1024 bytes). 

FIG. 2 shows a source program; FIG. 3 shows an over- 
view of the status of mapping from main memory to cache; 
FIGS. 4 and 5 show, in increased detail over that in FIG. 3, 
the status of mapping from main memory to cache; FIG. 6 
shows an example of an object program for the source 
program of FIG. 2; and FIG. 7 shows the generation of a 
cache conflict. 



>2,385 

2 

First, FIG. 2 shows an illustration program, here a source 
program, in which multiple cache conflicts are generated. In 
this source program 3, with a common declaration, arrays A, 
B, and C are allocated consecutively in the memory area in 

5 this order. Arrays A and C are two dimensional arrays; array 
B is a three dimensional array. Each array is a real type 
(floating point numbers) with element length of 4 bytes. The 
declared size for array A is (256, 128), for array B is (256, 
128, 3), and array C is (256, 128). The execution portion of 

10 program 3 is a doubly nested loop. For the innermost loop, 
for 1=1 to 255, the C(I,J) value is calculated using the values 
A(I J), B(U,2), A(I+l,j), B(I,J,3), and B(U,1). 

FIG. 3 shows a part of each array residing within main 
memory 120 and its corresponding part of cache 110, and the 

15 correspondence relationship itself. The size of each area in 
the memory that contains the parts of arrays, specifically 
A(l:256,l:128), B(l:256,l:128,l), B(l:256, 1:128,2), 
B(l:256, 1:128,3), and C(l:256,l:128), is 128Kbytes 
(=256*128*4). Accordingly, arrays A(l:256,l:128), 

20 B(l:256,l: 128,2), and C(l:256,l:128) are mapped in one 
common area in cache 110, while arrays B(l:256, 1:128,1) 
and B(l:256,l: 128,3) are also mapped in another common 
area in cache 110. The above mentioned expressions "m:n" 
show the range of subscripts from lower bound m to upper 

25 bound n. 

FIGS. 4 and 5 collectively show, in detail, the structure of 
cache 110 as well as its mapping with main memory 120. 

As shown, cache 110 is a collection of unitized 16 byte 
cache blocks, illustratively cache blocks 111 to 116. Main 
30 memory 120 is also shown with unitized memory blocks. 
Mapping from memory block 120 to cache 110 is shown 
with arrows. 

For example, in FIG. 4, three memory blocks A(1:4,J), 
3S B(1:4,J,2), and C(1:4,J) are mapped into one common cache 
block 111. However, only one of these three memory blocks 
can reside in cache block 111 at any one time. Because of 
this, cache conflict can arise where the data is referenced in 
these memory blocks through the cache. Similarly, cache 
conflict also occurs in cache blocks 112 and 113. 

Since, as shown in FIG. 5, two memory blocks belonging 
to the same array (B) are mapped into each one of three 
common cache blocks, cache conflict can be generated here 
as well. Consequently, as shown in this figure, cache conflict 
45 is also generated for accesses within the same array (B). 
FIG. 6 is an example of an object program that corre- 
sponds to source program 3 shown in FIG. 2. However, FIG. 
6 only shows the object code for the innermost loop in FIG. 
2. Since generally during execution of the program, the 
50 greater portion of the processing time is spent executing the 
innermost loop, the remaining portions, at least for this 
discussion, are not very important and, for simplicity, will be 
ignored. 

In FIG. 6, each instruction is identified by a unique 
55 instruction number 61 (specifically numerals 1 — 13) added 
to the left side of the figure. The label 62 is used as a branch 
target for branch instructions. Processing contents 65 of each 
instruction 63 for operand 64 are shown all the way to the 
right in this figure. Of these instructions, instructions 1, 2, 4, 
60 5, 8 and 10 are memory reference instructions, inasmuch as 
they reference array elements A(I,J), B(I,J,2), A(I+1,J), 
B(I,J,3), B(U,1), and C(IJ) respectively. 

In FIG. 7, the cache conflicts caused by memory reference 
instructions in the object program, shown in FIG. 6, are 
65 depicted in the order in which these instructions are 
executed. Each line shows, from the left: a value 70 (I value) 
for innermost loop control variable I; instruction number 71 



01/29/2004, EAST Version: 1.4.1 



5,8< 

3 

with each different instruction carrying its specific instruc- 
tion number from FIG. 6; instruction referenced array ele- 
ment 72; cache hit status 73 (shows whether or not refer- 
enced data exists in cache); accepted block 74, i.e., the 
memory block which was transferred from main memory 
120 and placed in cache 110; rejected block 75, i.e., the 
memory block which was rejected from cache 110, and 
cache miss 76 which shows whether this miss was caused by 
a cache conflict. 

As can be seen from FIG. 7, cache misses occur with all 
of the memory references, with most of these cache misses 
caused by cache conflict. For example, instructions 1 and 4 
for 1=1 both reference the same memory block A(1:4,J), but 
because memory block B(1,J,2) which is now mapped in the 
same cache block as was block A(l:4 r F) previously refer- 
enced by instruction 2, block A(1:4,J) is rejected from cache. 
This causes a cache miss to occur with instruction 4. 

Of the cache misses shown in FIG. 7, the cache misses 
generated by referencing the same memory block twice are 
all cache misses caused by cache conflict. Of the 24 cache 
misses from 1=1 to 1=4, 18 of them are cache misses caused 
by cache conflict. Thus, it is clear that cache conflict is a 
major reason for cache misses. 

One method of reducing cache conflict is using the full 
associative method or the set associative method as shown 
in the above prior art. However, with either of these 
methods, the implementing caching hardware becomes com- 
plex and cache referencing speed is reduced. Furthermore, 
with each of these methods, it is difficult to significantly 
increase the capacity of the cache. 

Moreover, with set associative method based cache, when 
the number of cache block candidates (associativity) to 
which the memory blocks are mapped is small, it is not 
possible to sufficiently avoid cache conflict. Consequently, 
for this type of cache, even when the associativity is small, 
use of a cache conflict reduction method is still quite 
necessary. 

With the direct map method, cache conflict may be 
reduced by simply increasing the capacity of cache memory. 
However, when compared with the cost of main memory of 
the same capacity, cache memory costs significantly more 
than main memory, hence limits exist to increasing cache 
capacity. 

As shown above, because the structure of direct map 
method cache is simple, this method has the advantages of 
providing enhanced access speed and ease of expanding 
cache capacity. However, with certain programs, disadvan- 
tageous^ cache conflict occurs and performance is greatly 
lessened. Until the present invention, as described below, a 
sufficient way did not exist to reduce cache conflict in direct 
map method cache. 

Even with set associative method based cache which 
generates relatively little cache conflict, nevertheless when 
associativity is small, cache conflict occurs and performance 
is greatly lessened. If the associativity is sufficiently 
increased, all of the cache conflict can be avoided, but at the 
cost of complicating the ensuing caching hardware structure, 
lowering cache referencing speed and making any signifi- 
cant increase in cache capacity quite difficult. Based on these 
reasons, the associativity of the set associative method based 
cache currently available on the commercial market is 
approximately 2 to 4, which, for programs that might 
generate an appreciable amount of cache conflict, is 
inadequate, in practice, to sufficiently prevent the cache 
conflict from occurring. 

SUMMARY OF THE INVENTION 

The purpose of this invention is to provide a method for 
reducing cache conflict for programs that are otherwise 
susceptible to a marked decrease in performance due to 
cache conflict. 



12,385 

4 

To achieve this purpose in computers with cache memory, 
the present invention detects, during program compilation 
into object code, the generation of cache conflict, which 
occurs between memory references in the input program. 
5 The invention also performs cache conflict analysis to 
determine which specific memory references cause the 
cache conflict. 

Based on the results of this analysis, when a cache conflict 
is detected and before all references would have been 

10 completed to a memory block previously transferred to a 
cache block, i.e., before all those references would be 
embodied in compiled object code ready to be executed, the 
memory references in the compiled object program are 
reordered such that the memory block would now not be 
rejected from cache. 

15 The reordering of memory references is done on the 
compiler intermediate code or on the object program gen- 
erated by the compiler. Alternatively, memory references on 
the compiler intermediate code can be reordered followed by 
a further reordering of the appropriate references on the 

20 object program generated by the compiler. Furthermore, the 
memory references on the program intermediate code can be 
reordered with resulting reordered intermediate code then 
being reverse converted (reverse compiled) back to a new 
source program version. This new version should ultimately 

25 generate far fewer, if any, cache conflicts. That version is 
then provided as an output source program for subsequent 
compilation and execution either in the computer which 
generated this program or another computer. 

My inventive method for use with computers with cache 

30 memories includes a cache conflict analysis step for detect- 
ing cache conflicts related to memory references in input 
source programs that are being compiled into object code. 
This step provides, as output cache conflict information, 
analyzed cache conflict generation conditions. In conjunc- 

35 tion with the analysis step, my inventive method includes a 
memory reference reordering step that changes the order of 
memory references in the source program, as it is being 
compiled, and particularly in corresponding intermediate 
code therefor, so that a memory block is not rejected from 

40 cache before all references to memory blocks, previously 
transferred to a common cache block, would have been 
completed. 

The above cache conflict analysis step includes a cache 
reference group analysis step in which memory references in 

45 the input program are extracted, and, thereafter, memory 
references to the same memory block are grouped as a cache 
reference group. During the latter step, the cache reference 
groups are classified with cache reference group information 
being then generated as cache conflict information. After the 

50 cache conflict information is generated, a cache conflict 
graph analysis step is performed through which, based on 
the cache reference group information, a graphical output 
depiction of cache conflict information is provided which 
shows the cache conflict conditions among different cache 

55 reference groups. 

When address distances, in main memory, among the 
memory references in the source program are below a 
designated value, the cache reference group analysis step 
concludes that these memory references are referencing the 

60 same memory block, and assigns these references to the 
same cache reference group. 

The cache reference group analysis step obtains memory 
blocks referenced to positions (addresses) in main memory 
of the memory references in the input program, and, for 

65 those memory references which refer to a common memory 
block, assigns those references to the same cache reference 
group. 



01/29/2004, EAST Version: 1.4.1 



5,8e 

5 

For arbitrary first and second cache reference groups 
included in the cache reference group information, the above 
cache conflict graph analysis step obtains a minimum value 
cache distance between all of the memory references 
included in the first cache reference group and all of the 
memory references included in the second cache reference 
group. When the minimum value is below the designated 
value, cache conflict is viewed as being generated between 
the memory references of the first cache reference group and 
the memory references of the second cache reference group. 

Furthermore, for the arbitrary first and second cache 
reference groups, the cache conflict graph analysis step also 
obtains the cache blocks mapped from the main memory 
position of all the memory references included in the first 
cache reference group, and likewise obtains the cache blocks 
mapped from the main memory position of all the memory 
references included in the second cache reference group. 
When there are memory references mapped to the same 
cache block, this step concludes that cache conflict is 
generated between the memory references of the first cache 
reference group and the memory references of the second 
cache reference group. 

After performing loop unrolling for a loop in the input 
program, the memory references reordering step, described 
above, changes the order of instructions to coincide with the 
order of the memory references. In particular, the loop that 
is to be unrolled and the loop unrolling count therefor are 
both selected such that the loop has a memory reference 
range length of approximately the same length, once the 
loop is unrolled, as the cache blocks. 

As a result of using my inventive method with each of 
those programs which would otherwise exhibit cache con- 
flicts and hence lowered performance, a corresponding 
object program or associated source program is generated 
that manifests fewer, if any, cache conflicts. 

By detecting when cache conflicts would be generated in 
the input program, then, in accordance with my present 
invention, specific processing can then be performed on that 
program which mainly focuses on eliminating these cache 
conflicts. Generally, when a cache conflict occurs, a marked 
decrease in performance results. Therefore, cache conflict 
reduction should be made a priority, for specified items, 
through detection of cache conflict generation areas. 

By doing cache conflict analysis, with the concomitant 
reordering of memory references, those memory references 
which would otherwise cause each cache conflict can be 
determined. 

Moreover, by reordering these memory references, cache 
conflict, in those memory areas where cache conflict would 
otherwise occur, can be avoided. For example, in order for 
a memory block not to be rejected from cache before all 
references, in that input program, to memory blocks previ- 
ously transferred to a cache block would have been 
completed, cache misses caused by cache conflict can be 
completely and advantageously avoided, as I now teach, by 
changing the order of the memory references in the program. 

By reordering the memory references for the intermediate 
code, redundant data dependency, heretofore caused by 
allocating a small number of registers to memory reference 
instructions, is eliminated. This, in turn, advantageously 
increases the degree of freedom for changing the order of 
memory references. Because of this, memory references can 
be moved to optimal positions with a concomitant improve- 
ment in the amount of cache conflict reduction that results. 

Advantageously, by reordering the memory references for 
the object program, all instructions including those which 



12,385 

6 

could not be foreseen at the intermediate code stage could be 
reordered. Hence, a very accurate instruction reordering 
could be accomplished to further reduce cache conflicts. 
Furthermore, by reordering the memory references on 

S both the program intermediate code and the object program, 
an optimal reduction in cache conflict could be obtained. 

Additionally, by reordering the memory references for the 
program intermediate code, then reverse compiling the 
results to a source program with this latter program being 

10 supplied as an output source program for later use, any 
subsequent compiler operating on the output source program 
will, a priori, then generate code having reduced cache 
conflicts. Thus, through use of my inventive method, on one 
computer to generate source code with reduced cache con- 

15 flicts and intended for eventual widespread distribution and 
execution on other computers, cache conflict reduction can 
be employed on a widespread basis thus increasing and 
possibly maximizing its general use. 

By unrolling loops for reordering memory references, 

20 cache conflicts for multiple loop iterations can be 
eliminated, thereby further improving cache conflict reduc- 
tion. In particular, by appropriately selecting the particular 
loops to be unrolled and the number of times each such loop 
is to be unrolled, the fewest loops will be unrolled consistent 

25 with highly effective loop unrolling. This, in turn, advanta- 
geously avoids having insufficient registers, and its attendant 
difficulties, which would otherwise occur whenever an 
excessive number of loops is to be unrolled. 

BRIEF DESCRIPTION OF THE DRAWINGS 
30 FIG. 1 is a block diagram of an example of a compiler 
using the compiling method of this invention; 

FIG. 2 shows an example of conventional source program 
3 that can generate cache conflicts; 
35 FIG. 3 shows an overview of conventional main memory 
mapping to cache for illustrative program 3 depicted in FIG. 
2; 

FIGS. 4 and 5 collectively show, in detail, the conven- 
tional main memory mapping to cache depicted in simplified 
40 form in FIG. 3; 

FIG. 6 shows a corresponding conventional object pro- 
gram for an inner-loop portion of source program 3 depicted 
in FIG. 2; 

FIG. 7 shows conventional cache conflict generation that 
45 would occur for the object program shown in FIG. 6; 

FIG. 8 is a high-level flowchart of cache conflict analysis 
process 30 shown in FIG. 1; 

FIG. 9 shows the proper orientation of the drawing sheets 
for FIGS. 9A and 9B; 
50 FIGS. 9A and 9B collectively show a detailed flowchart 
of cache reference group analysis step 32 shown in FIG. 8; 

FIG. 10 is a detailed flowchart of distance calculation step 
3218 executed within cache reference group analysis step 32 
ss shown in FIGS. 9Aand 9B; 

FIG. 11 is a detailed flowchart of cache conflict graph 
analysis step 34 shown in FIG. 8; 

FIG. 12 is a detailed flowchart of distance calculation step 
3408 executed within cache conflict graph analysis step 34 
60 shown in FIG. 11; 

FIG. 13 is a high-level, summarized, flowchart of memory 
reference reordering process 50 shown in FIG. 1; 

FIG. 14 is a detailed flowchart of loop unrolling method 
determination step 52 shown in FIG. 13; 
65 FIG. 15 is a detailed flowchart of cache conflict elimina- 
tion by intermediate code reordering process 57 shown in 
FIG. 13; 



01/29/2004, EAST Version: 1.4.1 



5,862,385 

7 8 

FIG. 16 shows illustrative intermediate code, for a portion may be simultaneously referred to during the course of the 

of source program 3 shown in FIG. 2, and prior to processing following description. 

that intermediate code to reduce cache conflicts therein; piG. 1 shows the structure of an example of a compiler 

FIG. 17 shows the general structure of cache conflict us j n g the compile method of the present invention. FIG. 28 

information, e.g., cache conflict information 9 shown in 5 shows the structure of a computer system for implementing 

FIG. 1; this invention. As shown in FIGS. 1 and 28, compiler 1 

FIG. 18 shows cache conflict information that results resides in memory 284 of central processing unit (CPU) 280, 

from use of my inventive method for exemplary intermedi- the pr0 gram (and accompanying data) to be compiled 

ate code 5 for source program 3 shown in FIGS. 16 and 2, resides in ^ 283) and thc display of analysis results and 

respectively; 10 user instructions is presented through input/output (I/O) 

HG. 19 shows the illustrative intermediate code shown in devicc 281 FIG 29 shows the structurc of CPU 28 0 which 

FIG. 16 but after the cache reference group numbers for fe executed 5 a ram by my inventive method, 

cache conflicts have been written into this code; within cpu ^ ^ m UQ (hereafter referred t0 

L FIG V 2 °^o W ^ t ? e P i°?ni oneDtatlon of lhe drawm S simply as "cache") is installed between main memory 120 

sheets for FIGS. 20A and 20B; 15 and arithmetic and logic unit (ALU) 291, the latter executing 

FIGS 20A and 20B collectively show the illustrative ^ stQred k main memory UQ ^ cpTJ alsQ 

intermediate code shown in FIG. 19 but after both loop holds g ion of (he Q and accornpanyi ng data, all 

unrollmg and vanable renaming have been completed £ ^ m mg . q me ^ 

therein through execution of steps 53 and 54, respectively, w ™« , * c -,- 

shown in FIG 13* ^ow specifically refernng to FIG. 1 and to facilitate 

FIG. 21 shows' the proper orientation of the drawing 2 ° understanding, I will now summarize the processing under- 

sheets for FIGS. 21A and 21B; ^ aken b y this Preferred embodiment as it relates to my 

FIGS. 21A and 21B collectively show intermediate code invention, 

in FIGS. 20A and 20B but after common array elements Compiler 1 accepts source program 3, as input, and 

have been eliminated from this code through execution of ultimately generates object program 7. Compiler 1 performs 

step 55 shown in FIG. 13; 25 the following processes in order: syntax analysis process 10, 

FIGS. 22Aand 22B respectively show the structure of a cache conflict reduction process 20, and code generation 

loop unrolling candidate table in general, and, in particular, process 60. With widely known technology, syntax analysis 

a loop unrolling candidate table for illustrative intermediate process 10 converts source program 3 into intermediate code 

code 5 as it undergoes loop unrolling; 5 for subsequent use by the compiler. Cache conflict reduc- 

FIG. 23 shows a data dependency graph for intermediate 30 tion process 20 is performed on intermediate code 5. Code 

code 5 shown in FIGS. 21A and 21 B; generation process 60 generates object program 7 from 

FIG. 24 shows the proper orientation of the drawing intermediate code 5 but as altered by cache conflict reduc- 

sheets for FIGS. 24A and 24B; tion process 20. 

FIGS. 24A and 24B collectively show an illustrative ^ Consequently, as readily seen, object program 7 is gen- 
schedule table to which statements in intermediate code 5 e rated for input source program 3 but upon which cache 
shown in FIGS. 21Aand 21B have been allocated according conflict reduction process 20 was performed, 
to data dependency graph 24 shown in FIG. 23; Nextj j ^ explain the specific processing undertaken by 

FIG. 25 shows the proper orientation of the drawing cache conflict reduction process 20 which is characteristic of 

sheets for FIGS. 25A and 25 B; 4Q my invention. This processing contains three basic steps, 

FIGS. 25 A and 25B collectively show intermediate code with the processing composed of these three steps used on 

5 but after complete execution of my inventive cache the intermediate code of the loops included within interme- 

conflict reduction method on the intermediate code shown in diate code 5. For purposes of simplifying the ensuing 

FIGS. 21A and 2 IB; discussion, the following example limits the processing to 

FIG. 26 shows the proper orientation of the drawing 45 the intermediate code of innermost loops. Of course, it is 

sheets for FIGS. 26A and 26B; easy to process other intermediate code through my inven- 

FIGS, 26A and 26B collectively show a resulting object live method. Nevertheless, since most of the execution time 
program after execution of my inventive cache conflict for an object program is consumed in executing the inner- 
reduction method and specifically that which results after most loops in that program, then, in almost all cases, 
object level compilation has been performed on intermediate 50 processing only these innermost loops through my inventive 
code 5 shown in FIGS. 25A and 25B; method is sufficient. Hence, I will explain the processing 

FIG. 27 shows the cache conflict generation that occurs undertaken by the three steps of cache conflict reduction 

for object code 7 shown in FIGS. 26A and 26B; process 20 in the context of the innermost loops. 

FIG. 28 shows the structure of a computer system for In particular, within process 20, conflict analysis process 

implementing my present invention; and 55 30 first detects whether intermediate code 5 would generate 

FIG. 29 shows the structure of the central processing unit cache conflicts and then analyzes the conditions which 

(CPU) in the computer system, depicted in FIG. 28 and used would generate each conflict. The detected instances of 

to execute a program generated by the compile method of cache conflicts, i.e., "cache conflict generation detection 

my present invention results", provided by process 30 are used in the following 

To facilitate reader understanding, identical reference «> stop, i.e., decision step 40. Cache conflict generation con- 
numerals are used to denote identical or similar elements dltlons ™ P™^> b X P^f 3 30 > as cacl * conflict infor- 
mal are common to the figures. matlon 9 for subsequent use by memory reference reorder- 

ing process 50. 

DESCRIPTION OF THE PREFERRED Based on the results of cache conflict analvsis 30> i c > 

EMBODIMENT 65 whether a cache conflict would be generated by intermediate 

I will now describe the preferred embodiment of the code 5, decision step 40 determines whether memory ref- 

invention while referring to the figures, several of which erence reordering process 50 will then be executed. In 



01/29/2004, EAST Version: 1.4.1 



5,862,385 

9 10 

particular, when such a cache conflict would be generated, as ing optimization processing between syntax analysis process 

detected by cache conflict analysis process 30, a cache 10 (as shown in FIG. 1) and cache conflict reduction process 

conflict is said to exist. Hence, decision block 40 routes 20, another format can be used for optimization processing, 

execution to process 50 which, in turn, reorders the memory In that regard, a three address statement can be represented 

references to reduce, if not eliminate, these conflicts, s by, e.g., a syntax tree generally used for optimization pro- 

Alternatively, if analysis process 30 finds that no cache cessing; hence, the format of intermediate code 5 can be one 

conflict would be generated within intermediate code 5, consistent with a syntax tree. 

decision block 40 routes execution around memory refer- Cache conflict information 9, as shown in FIG. 1, speci- 

ence reordering process 50, hence skipping any such fi es the cache conflict generation conditions that result from 

reordering, thus effectively terminating further processing of 10 memory references of those statements in intermediate code 

intermediate code 5. 5. Using FIG. 17, 1 will now explain the structure of cache 

Through memory reference reordering process 50, the conflict information 9. 

order through which memory references would be carried As shown in FIG. 17, cache conflict information 9 is 

out is changed based on cache conflict information 9 in order constructed from cache reference group information 9900 

to reduce cache conflict. In essence, when the memory 15 and cache conflict graph 9500. Cache reference group infor- 

reference order is changed, the resulting order is such that all mation 9900 is a summary of memory references to common 

references to a memory block previously transferred to a memory blocks and consists of cache reference groups, only 

cache block are completed before other conflicting memory one of which, i.e., group 9010, is generally shown. These 

references occur to the memory block; thus subsequently groups, in turn, are then classified as described below. Cache 

preventing the memory block from being rejected from 20 conflict graph 9500 shows the cache conflict generation 

cache. conditions between cache reference groups. In other words, 

With the above in mind, I will next describe the structure cache reference group information 9900 is a collection of 

of data input and output for each step within cache conflict cache reference groups, and each cache reference group, 

reduction process 20. Inasmuch as the specific structures of e.g., group 9010, is a collection of statements which perform 

source program 3 and object program 7 are not related to my 25 memory references to the same memory block in interme- 

inventive process 20, these structures are omitted. However, diate code 5. 

the structure of intermediate code 5 and cache conflict Specifically, each cache reference group, such as group 

information 9 will be explained. 9010, is composed of a cache reference group number, e.g., 

Specifically, intermediate code 5 is an internal compiled 3Q number 9020, and for each memory reference statement in 

representation of a source program, the representation being that group: a statement number (9040 denoting all such 

similar to that shown in FIG. 2. In this preferred statement numbers in this group), a referenced array element 

embodiment, a structure known as a three address statement (9060 denoting all such array elements in this group), and an 

is used for intermediate code 5. Three address statements are address calculation expression (9090 denoting all such 

described in A. V. Aho and J. D. Ullman, Compilers (<D1985, expressions in this group) for corresponding referenced 

Addison Wesley Publishing Co.), pages 464-473. Two col- array element 9060. 

umns of data are required for use with cache conflict With the group number of cache reference group 9010 as 

reduction process 20. In that regard, a statement number a node, cache conflict graph 9500 has edges (lines) between 

(305 shown in FIG. 16) column and reference group number each pair of separate cache reference groups which generate 

(330) column are used along with the three address state- 4Q each cache conflict. Having a cache conflict generated 

ments. between two cache reference groups means that the memory 

FIG. 16 shows an example of intermediate code 5. As blocks which reference the memory reference statements in 
shown, intermediate code 5 corresponds to source program each cache conflict group, e.g., group 9010, are mapped into 
3, the latter shown in FIG. 2. As shown in FIG. 16, the the same cache block. As specifically shown in FIG. 17, 
statements in intermediate code 5 are lined up in execution 45 nodes 9510 and 9520 of cache conflict graph 9500 are 
order. Each of these statements is composed of: statement connected by edge 9590; hence, this figure shows that a 
number 305, which starts at 1 and increases incrementally; cache conflict is generated between two cache reference 
label 310 which specifies a branch target; three address groups corresponding to these two nodes, 
statement 320 and, as shown, its contents — all of such steps I have now completed my summary explanation of the 
implementing the innermost loop in source program 3 (see 50 inventive compiler shown in FIG. 1. Accordingly, I will 
FIG. 2); and reference group number 330 shown in FIG. 16 proceed to explain, in detail and in conjunction with a 
which is used as a work area by cache conflict reduction specific example, the processing undertaken by cache con- 
process 20. Label 310 is not used other than as a branch flict analysis process 30, and later in the context of this 
target. Reference group number 330 is added only to example, the ensuing results, and thereafter memory refer- 
memory references; nothing is present in this column at the 5S ence reordering process 50 — both of these steps being 
input to cache conflict reduction process 20. contained within cache conflict reduction process 20 shown 

By using a three address statement as the format of each in FIG. 1. 

statement in intermediate code 5, each memory reference is I will begin by explaining, in detail, cache conflict analy- 

separated into individual statements. As such, it is easy to sis process 30. Through process 30, memory references in 

add such statements or change the order of memory refer- 60 intermediate code 5 such as those shown in FIG. 16 are 

ence statements. Though my inventive method can be imple- analyzed, and resulting cache conflict information 9, as 

mented using other formats, than a three address statement, shown in FIG. 17, is generated. 

for intermediate code 5, accompanying processing does FIG. 8 shows a high-level flowchart of the processing 

become increasingly complex. procedure of cache conflict analysis process 30 shown in 

It is sufficient to convert the source program to appropri- 65 FIG. 1. 

ate three address statements immediately prior to performing As shown in FIG. 8, upon entry into process 30, execution 

cache conflict reduction process 20. Hence, when perform- proceeds to cache reference group analysis step 32. This 



01/29/2004, EAST Version: 1.4.1 



5,862,385 

11 12 

latter step, through analyzing the memory reference state- After step 3210 has been executed, step 3212 is executed 

ments in intermediate code 5 (shown in FIG. 16), summa- to establish cache reference group CAG(m), for which the 

rizes references to memory blocks as cache reference groups cache reference group number is m, and to register statement 

(such as group 9010 shown in FIG. 17), and generates cache Si to this cache reference group, i.e., CAG(m). As shown in 

reference group information 9900. Next, cache conflict 5 FIG. 17, when statement Si is registered, a statement number 

graph analysis step 34 is executed which, based on cache (9040), a referenced array element (9060), and an address 

reference group information 9900, analyzes cache conflict calculation expression are written for this statement to cache 

conditions between different cache reference groups, and reference group information 9900, and specifically to an 

generates cache conflict graph 9500 in, e.g., graphical form, entry for cache reference group CAG(m). Because the 

which represents the results of this analysis. 10 method for obtaining the address calculation expression 

Thereafter, decision step 36, when executed, determines (9090) from the referenced array element (9060) is widely 

whether or not an edge exists in cache conflict graph 9500, known, it will be omitted here. 

i.e., whether this graph shows a conflict between two cache Next, as shown in FIGS. 9A and 9B, steps 3214 to 3232 

reference groups. If such an edge exists, decision step 38 are repetitively performed to process the statement that 

routes execution, via its true (T) path, onward to step 38, at 15 immediately follows statement Si, i.e., statement S(i+1), 

which point processing of step 30 ends. Alternatively, if no until the last statement. For ease of reference, the particular 

such edge exists, decision step 38 routes execution, via its statement currently being processed by these steps is 

false (F) path, onward to step 39 at which point no cache denoted as statement Sj. 

conflict is said to exist and processing ends. The result, if Decision step 3216, when executed, determines whether 

true, from process 30, specifically the fact that an edge exists 20 statement Sj references memory, i.e., whether this statement 

in graph 9500, is passed onward to decision step 40 shown is, among other things, a memory reference statement. If 

in FIG. 1. statement Sj references memory, execution is routed by this 

Next, using FIGS. 9A-12, I will explain, in detail, the decision block, via its true path, onward to step 3218. 

processing undertaken by cache reference group analysis Alternatively, if statement Sj does not reference memory, 

step 32 and cache conflict graph analysis step 34 shown in 25 then this statement is not processed further with execution 

FIG. 8. routed, via a false path emanating from decision block 3216, 

FIGS. 9A-9B collectively show a detailed flowchart of onward to step 3232. 

cache refereuce group analysis step 32; the correct orienta- When executed, step 3218 calculates a distance, d, 

tion of the drawing sheets for these figures is shown in FIG. between memory reference statement Sj and cache reference 

9. As explained above, through execution of cache reference group CAG(m). The distance between memory reference x 

group analysis step 32, cache reference group information and cache reference group y is a minimum value of distance 

9900 is generated from intermediate code 5. in main memory between memory reference x and all the 

Upon entry into step 32, as shown in FIGS. 9A-9B, memory references registered to cache reference group y. 

through execution of step 3202 cache reference group num- 35 ^ calculation method for this distance is explained in 

ber n is initialized to zero. Steps 3204 and 3234 represent a detail later through FIG. 10. 

"repeat" ("looping") operation. Processing is done, within Once step 3218, shown in FIGS, 9A and 9B, is completed, 

step 32, in order from the first statement to the last statement execution proceeds to decision step 3220, which determines 

of intermediate code 5, with the current statement being whether distance d is less than or equal to a designated value 

processed in code 5 being denoted by variable Si. 40 Dmax. If distance d is less than or equal to Dmax, decision 

Thereafter, decision step 3206, when executed, deter- ste P 3220 routes execution, via its true path, onward to 

mines whether statement Si references memory, i.e., whether decision step 3222. Alternatively, if distance d exceeds 

that statement is, among other things, a memory reference Dmax > then statement Sj is not processed further with 

statement. If this statement references memory, execution decision block 3220 routing execution, via its false path, to 

proceeds, via the true (T) path from this decision step to 45 ste P 3232. 

decision step 3208, Alternatively, if statement Si does not Designated value Dmax is a "threshold value" for deter- 

reference memory, then this statement is not processed mining whether or not memory references refer to the same 

further by cache reference group analysis step 32 and memory block and is based on the distance between memory 

execution proceeds, via the false (F) path emanating from references in main memory. This threshold value should be 

decision step 3206, onward to step 3234. Decision step so set equal to ("the memory block length")/2. If distance d is 

3208, when executed, determines whether statement Si has less than or equal to the threshold value Dmax, then both 

been registered to (associated with) any cache reference memory references basically refer to the same memory 

group, e.g., group 9010. If no such registration has occurred, block. As described below and with reference to this pre- 

decision step 3208 routes execution, via its false path, ferred embodiment, it has been determined that Dmax is 

forward to step 3210. Alternatively, if statement Si has 55 preferably equal to 4. However, Dmax is not so limited as 

already been registered to any cache reference group, then long as the value chosen for Dmax is one which can be used 

no further processing occurs within step 32 for this state- to determine whether or not two memory references refer to 

ment and execution proceeds, via the true path emanating the same memory block. 

from decision step 3208, onward to step 3234 to process the Consequently, in the preferred embodiment, decision 

next statement. 60 block 3220 determines whether or not memory reference 

Step 3210, when executed, increments number n by 1 and statement Sj and the memory reference in cache reference 

then sets variable m to the current n value. Variable m is the group CAG(m) refer to the same memory block, 

initial value of the cache reference group number for the An alternate method for determining whether or not 

corresponding cache reference group that will be processed memory references refer to the same memory block relies on 

in the following steps. Cases exist where the value of 65 directly obtaining the referenced memory block from the 

variable m changes while processing of a cache reference position in main memory of the referenced array element, 

group is in progress. However, here, since the array element address changes 



01/29/2004, EAST Version: 1.4.1 



5,862,385 

13 14 

depending on the loop iteration, one must necessarily and Decision step 3264, when executed, determines whether 

sufficiently consider that the referenced memory block also an absolute value, abs, of the difference diff is smaller than 

changes; hence, processing becomes somewhat complex. distance d. If the absolute value is smaller in that regard, 

Nevertheless, this alternate method, if sufficiently expanded then decision step 3264 routes execution, via its true path, to 

in this way, could be used with the preferred embodiment, 5 step 3266, wherein, when executed, the absolute value of 

Decision step 3222, when executed, determines whether difference diff is set as the new value for distance d. 

statement Sj is registered to any of the existing cache Alternatively, if decision step 3264 determines that the 

reference groups, e.g., group 9010. If statement Sj is not so absolute value exceeds or equals distance d, then this 

registered, then this decision block routes execution, via its decision step routes execution, via its false path, directly to 

false path, to step 3224. Through execution of step 3224, 1Q step 3268, and hence, back to step 3256, thereby skipping 

processing statement Sj is registered to cache reference step 3266. 

group CAG(m), from which step execution proceeds for- Based on the above processing, difference d is set to be a 

ward to step 3232. Alternatively, if decision block 3222 minimum value of the distance in memory between the 

determines that statement Sj is registered in an existent memory references of statement Sj and all memory refer- 

cache reference group, then this decision block routes ences in cache reference group CAG(m). However, when 

execution, via its true path, to block 3226. 15 this distance is not a constant, distance d remains set, 

Through execution of steps 3226 to 3230, the memory through step 3272, as the expression which shows the 

references in cache reference group CAG(m) are merged distance between two memory references in this group, 

with existing cache reference groups. First, step 3226, when When the distance d is an expression, the size relationship 

executed, re-registers all memory references currently reg- between d and Dmax is unclear. Hence, by using the value 

istered in cache reference group CAG(m), into the one cache 20 °f toe distance d set in step 3218, the result of the compari- 

reference group, of groups 9010, to which statement Sj son (d^Dmax?) undertaken by decision step 3220 in FIGS, 

belongs. Next, through execution of step 3228, value n is 9 ^and 9B becomes false. As such, the memory references 

decremented by one, and cache reference group CAG(m) is of statement Sj are not mapped into the same cache block as 

eliminated. Thereafter, through execution of step 3230, the are the memory references in cache reference group CAG 

cache reference group number of Sj is set to m. 25 ( m )* 

Steps 3232 and 3234 represent ends of repeat (loop) This concludes my explanation of cache reference group 

processing; step 3236 represents the end of execution of analysis step 32. With the above described processing, cache 

cache reference group analysis step 32. reference group information 9900, encompassing all the 

Next, I will describe the processing inherent in step 3218 cache reference groups, is generated, 

mentioned above, specifically, the calculation of the distance Next, 1 will explain, in detail, the processing inherent m 

d between the memory references of processing statement Sj cache conflict graph analysis step 34 shown in FIG. 8. 

and cache reference group CAG(m). FIG. 11 shows a detailed flowchart of cache conflict graph 

FIG. 10 shows a detailed flowchart of distance calculation analysis step 34. This step generates cache conflict graph 

step 3218. Upon entry into step 3218, execution proceeds to 35 9500 based on cache reference group information 9900 

step 3252, which obtains the address calculation expression ( bot h of wni ch are shown in FIG. 17). 

AEj for the memory referenced by statement Sj. The address As shown in FIG. 11, upon entry into step 34, step 3402 

calculation expression AEj can be easily obtained from array is executed to generate the nodes of cache conflict graph 

elements specified by memory references. Since this is a 9500. A node is made for each cache reference group; the 

well known method, I will omit all detailed explanation 40 cache reference group number, such as number 9020, of 

thereof. each cache reference group, such as group 9010, is also 

Thereafter, execution proceeds to step 3254 which sets the added °y this ste P- 

initial value of d equal to Dmax+1. This is followed by Thereafter, through repetitive execution of steps 3404 to 

repetitive execution of steps 3256 to 3268. Steps 3256 to 3418, the edges (lines) between these newly created nodes 

3268 are executed sequentially for all memory references Ai 45 are established. These edges are set between the cache 

registered in cache reference group CAG(m) which is one of reference groups, e.g. group 9010, that have been 

the cache reference groups 9010. determined, in the manner described above, as generating 

Through execution of step 3258, the address calculation cache conflict. In particular, steps 3404 to 3418 are repeated 

expression, AEi, 9090 of memory reference Ai is extracted for eacn value of U incrementing by one from 1 to "the 

from cache reference group CAG(m). Thereafter, step 3260 50 number of cache reference groups". Steps 3406 to 3416 are 

is executed to calculate a difference, diff, between address also repeated, but for each value of i, incrementing by one 

calculation expressions AEi and AEj. Inasmuch as step 3260 from i+1 to "the number of cache reference groups", 

is implemented with conventional, and simple symbolic Specifically, through execution of step 3408, distance dc 

expression processing, I will omit all details here. within cache 110 (see FIG. 29) and between cache reference 

Consequently, through the above described processing, the 55 groups CAG(i) and CAG(j) is determined. The method for 

address difference for the memory references is obtained. obtaining distance dc will be explained, in detail, below with 

Next, decision step 3262 is executed to determine whether reference to FIG. 12. 

difference diff is a constant. If this difference is a constant, Thereafter, decision step 3410 shown in FIG. 11, when 

execution proceeds, via the true path emanating from this executed, determines whether distance dc is less than or 

decision step, to decision step 3264. Alternatively, if differ- 60 equal to designated value DCmax. If this inequality is 

ence diff is not a constant, then this difference is not satisfied, decision step 3410 routes execution, via its true 

processed further by steps 3264 to 3268 and execution path, onward to decision step 3412. Alternatively, if distance 

proceeds, via the false path emanating from decision step dc exceeds value DCmax, then steps 3412 and 3414 are 

3262, to step 3272. Execution of step 3272 sets distance d skipped, with decision block 3410 routing execution, via its 

equal to the current difference value diff, with execution 65 false path, to step 3416. 

thereafter proceeding to final step 3270 signifying comple- Designated value DCmax is a threshold value for deter- 

tion of distance calculation step 3218. mining whether or not cache conflict is generated based on 



01/29/2004, EAST Version: 1.4.1 



5,862,385 



15 



16 



the distance between memory references in cache. Suitable 
values for DCmax are from 0 to "cache block length". 
Hereafter, I assume DCmax to have the value 4. However, 
the threshold DCmax value is not limited to those given 
above and can be any value which determines whether or not 
cache conflict is generated. Hence, in this preferred 
embodiment, as described above, a determination is made 
whether or not memory blocks of memory references in 
cache reference group CAG(i) and memory references in 
cache reference group CAG(j) are mapped to the same cache 
block, in other words, whether or not these memory refer- 
ences cause cache conflict with each other. 

Another method that can be alternatively used to deter- 
mine whether or not memory blocks are mapped to the same 
cache block involves directly obtaining cache blocks 
mapped from the position of the reference array elements in 
main memory. However, in this case, the array element 
addresses change with loop iterations. Therefore, it is impor- 
tant to take into consideration that mapped cache blocks also 
change, which, in turn, complicates attendant processing. 
Nevertheless, this alternate method can be used in the 
preferred embodiment of my invention. 

Now, continuing with the discussion of step 34 shown in 
FIG. 11, decision step 3412, when executed, determines 
whether an edge has been established, i.e., already exists, 
between nodes i and j in cache conflict graph 9500. If such 
an edge does not yet exist in the graph, then decision step 
3412 routes execution, via its false path, to step 3414, which, 
in turn, sets (places) an edge in the graph between nodes i 
and j. Execution then loops back, if appropriate, as noted 
above, via steps 3416 and 3406, to step 3408, and so on. 
Alternatively, if this particular edge already exists in the 
graph, then decision step 3412 routes execution, via its true 
path, directly to step 3416 from which execution loops back, 
if appropriate and as noted above, to step 3408, and so on; 
else execution of step 34 terminates at step 3420. 

Next, 1 will explain the processing inherent step 3408 
which as described above, determines distance dc on cache 
110 between cache reference groups CAG(i) and CAG(j). 

FIG. 12 shows a detailed flowchart for step 3408. Upon 
entry into step 3408, execution proceeds to step 3452 which, 
when executed, sets an initial value for distance dc as 
equaling DCmax+1. Steps 3454 to 3466 are then repetitively 
executed for each and every memory reference Ai registered 
to cache reference group CAG(i). 

Through execution of step 3456, distance diffc between 
memory reference Ai and cache reference group CAG(j) is 
obtained. Since virtually the same processing is used in this 
step as is used to implement step 3218, explained above in 
conjunction with FIG. 10, for the sake of brevity, I will omit 
any further details of this processing. After step 3456 has 
fully executed to determine distance diffc, step 3458 is 
executed to determine whether distance diffc is a constant. 
If this distance is a constant, then decision step 3458 routes 
execution, via its true path, onward to step 3460. 
Alternatively, if distance diffc is not a constant, then decision 
step 3458 routes execution, via its false path, to step 3470. 

Through execution of step 3460, a modulus, i.e., remain- 
der of a quotient, of distance diflc divided by the cache size 
is calculated, with the distance diffc then being set to a value 
of the remainder. This calculation provides the distance on 
cache 110. Next, decision step 3462 is executed to determine 
whether an absolute value of distance diffc is less than 
distance dc. If the absolute value is less than the current 
value of distance dc, then decision step 3462 routes 
execution, via its true path to step 3464, which sets distance 



10 



15 



20 



25 



45 



50 



55 



dc equal to this absolute value. Alternatively, if decision step 
3462 determines that the absolute value of distance diffc 
exceeds or equals distance dc, then step 3464 is skipped with 
execution proceeding, via the false path emanating from this 
decision block, to step 3466. Step 3470 when executed, sets 
distance dc equal to the expression for distance diffc. 
Thereafter, execution loops back, if appropriate and as noted 
above, via steps 3466 and 3454, to step 3456, and so on; else, 
execution of step 3408 terminates at step 3468. 

Once step 3468 is reached, execution of cache conflict 
graph analysis step 34 ends, with cache conflict graph 9500 
being generated. Hence, as explained above in conjunction 
with FIG. 8, at this point in the processing, decision step 36 
is executed to determine whether an edge, hence indicative 
of a cache conflict, exists in cache conflict graph 9500. 

FIG. 18 shows the contents of cache conflict information 
9 after cache conflict analysis process 30 (as shown in FIGS. 
8 to 12 and as described above) is performed on intermediate 
code 5 (discussed above and shown in FIG. 16) of source 
program 3 shown in FIG. 2. As shown in FIG. 18, cache 
conflict graph 9500 contains a graph consisting of nodes 
951, 952 and 955, and a graph consisting of nodes 953 and 
954. Cache reference group information 9900 includes mul- 
tiple cache reference groups 9010, 9110, 9210, 9310, and 
9410, with corresponding cache reference group numbers 
9020 to 9024 having been added to the respective cache 
reference groups. 

From cache conflict graph 9500 shown in FIG. 18, cache 
conflict occurs between cache reference groups 9010, 9110, 
and 9410 with the cache reference group numbers 1, 2 and 
5, respectively. Also, this graph indicates that cache conflict 
occurs between cache reference groups 9210 and 9310 with 
the cache reference group numbers 3 and 4, respectively. 

Cache reference group 9010 is made from two memory 
references A(I,J) and A(I+1,J), each of the other reference 
groups 9110, 9210, 9310, 9410 consists of only one memory 
reference. The reason that A(I,J) and A(I+1,J) are both in the 
same cache reference group 9010 is that, for cache reference 
group analysis step 32 (shown in FIG. 8), the difference 
between the address calculation expressions for these 
memory references as given in FIG. 18; namely, "(the 
starting address of A)+4*(I-1+256*(J-1))" and "(me start- 
ing address of A)+4*(I+1-1+256*(J-1))" is -4. Here, the 
absolute value is less than or equal to Dmax (in this case, as 
noted earlier, Dmax is taken to be 4). 

The reason that an edge exists between nodes 952 and 951 
in cache conflict graph 9500 is as follows. In cache conflict 
graph analysis step 34 (shown in FIG. 8), the difference 
between respective address calculation expressions, given in 
FIG. 18, "(the starting address of B)+4*(I-1+256*(J-1)+ 
256*128*(2-1))" and "(the starting address of A)+4*(I-1+ 
256*(J-1))" of memory references B(y,2) and A(IJ), 
respectively, in cache reference groups 9210 and 9010 
corresponding to the above node numbers is "4*256*128+ 
(the starting address of B — the starting address of A)". In 
comparison, as shown in FIG. 3, for contiguous storage, the 
starting address for array B is situated 128Kbytes after the 
starting address of array A, so the above mentioned differ- 
ence becomes 256Kb ytes, the remainder after division by 
cache size (256Kbytes) is 0, and the absolute value (0 in this 
case) is clearly less than Dmax (DCmax»4 in this case). 
Inasmuch as the remaining edges in FIG. 18 are derived in 
the same similar manner as that for the edge between nodes 
951 and 952, for brevity I will omit any such discussion for 
all the remaining edges in cache conflict graph 9500. Hence, 
I will not explain other portions of FIG. 18. 



01/29/2004, EAST Version: 1.4.1 



5,862,385 

17 18 

As described above, after the cache conflict information is target variables of assignment statements are changed so as 

generated, through execution of cache conflict analysis to be different from each other. Doing so increases the 

process 30, and cache conflicts have been verified as existing freedom to change the order of the intermediate code, and 

through execution of decision step 40 shown in FIG. 1, allows for more optimal reordering to be subsequently 

memory reference reordering process 50 is then performed. 5 undertaken. Conversely, if an assignment statement for the 

Having fully explained process 30 and decision step 40, I same variable is unchanged, i.e., left as is, these assignment 

will now explain, in detail, reference reordering process 50. statements can not be reordered with respect to each other, 

Through execution of process 50, cache conflicts are thereby increasing the difficulty through which the interme- 

reduced by changing the order of memory references so that, diate code can be reordered. Since the variable renaming 

after re-ordering, all references to a memory block, previ- 10 process performed by step 54 is also well known, I will not 

ously transferred to a cache block, are completed before explain it in any further detail here, 

other memory references to that memory block occur; thus for example, if the loop unrolling step 53 and variable 

preventing the memory block from being rejected from renaming step 54 are executed on intermediate code 5 shown 

cache 110. in FIG. 19, resulting intermediate code 5 as shown in FIGS. 

FIG. 13 is a high-level flowchart which provides an * 5 20A and 20B is obtained; for which the correct orientation 

overview of the processing inherent in memory reference of the drawing sheets for these figures is shown in FIG. 20. 

reordering process 50. Upon entry into process 50, execution As explained above, given the determination of the loop 

proceeds to step 51, which, when executed, writes cache unrolling method, i.e., the pertinent parameters therefor, 

reference group numbers, e.g., number 9020, for interme- determined in step 52 of illustratively, unrolling the inner- 

diate code 5 into reference group numbers 330 (see FIG. 19) 20 most loop of I four times, then, as shown in FIGS. 20A and 

of memory reference statements but only for those cache 20B, the innermost loop has been unrolled by 4 times for I. 

reference groups, e.g., group 9010, which cause cache Where possible, variables, existing in code 5 shown in FIG. 

conflicts according to cache conflict information 9. For 19, have been renamed in the code shown in FIGS. 20A and 

example, based on the cache conflict information 9 shown in 20B. 

FIG. 18, once intermediate code 5, shown in FIG. 16, has 25 After variable renaming is completed, execution proceeds 
been processed to include the corresponding cache reference to step 55 as shown in FIG. 13, which, when executed, 
group numbers as reference group numbers 330, the result- eliminates references to the same array elements in inter- 
ing intermediate code is shown in FIG. 19. mediate code 5 as this code then exists. As a result, redun- 
Re turning to FIG. 13, once step 51 has fully executed to dant memory references are eliminated. Since the elimina- 
fully write all the corresponding cache reference group 30 tion process undertaken by step 55 is also widely known, for 
numbers, step 52 is executed. Step 52, based on cache brevity I will omit any further explanation of it. 
conflict information 9, determines a particular loop unrolling For example, when the elimination of common array 
method which will be performed in a next step, i.e., step 53. elements is performed through execution step 55 on inter- 
Loop unrolling step 53 is performed as part of memory ^ mediate code 5 shown in FIGS. 20Aand 20B, intermediate 
reference reordering process 50 because I anticipate that code 5 such as that shown in FIGS. 21 A and 2 IB is obtained; 
memory references to the same memory block will occur the correct orientation of the drawing sheets for FIGS. 21A 
over multiple loop iterations. Through use of loop unrolling and 2 IB is shown in FIG. 21. For the illustrative interme - 
step 53, multiple changes in the order of memory references diate code shown in FIGS. 20A and 20B, statements 11, 21 
can be undertaken over multiple loop iterations, so that and 31 as identified by statement number 305, have all been 
overall cache reduction process 20, of which reordering eliminated thereby yielding intermediate code 5 shown in 
process 50 is a part, maximally reduces cache conflicts. FIGS. 21A and 21B. Whenever loop unrolling is performed, 
Hence, for optimal cache conflict reduction, step 52 is used several redundant array references are generated as shown in 
to determine the optimal loop unrolling method. I will later FIGS. 20Aand 20B, thereby necessitating the elimination of 
explain, in detail, how step 52 determines the proper loop these references through step 55. 

unrolling method. Finally, as shown in FIG. 13, step 57 is executed. Based 

For example, based on cache conflict information 9 on reference group numbers 330 written to cache conflict 

shown in FIG. 18, and for intermediate code 5 of FIG. 19, information 9 and as well to intermediate code 5, the 

optimal loop unrolling involves unrolling the innermost loop intermediate code, that resulted from execution of step 55, 

of I four times. In that regard, step 52 determines the 5Q such as that shown in FIGS. 21 A and 2 IB, is reordered 

parameters of unrolling, i.e., which loop to unroll and the through step 57 to reduce cache conflicts. Specifically, the 

number of times the loop should be unrolled. order of memory references is changed so that, after 

Once the optimal loop unrolling method has been re-ordering, all references to a memory block, previously 

determined, execution proceeds to step 53, which, when transferred to a cache block, are completed before other 

executed actually unrolls the loop according to these loop 55 memory references occur to that memory block; thus pre- 

parameters. Once the pertinent parameters, such as those set venting the memory block from being rejected from cache 

forth in the example immediately above, are determined by 110. 

step 52, the actual process itself of unrolling a loop is well I have now explained, in an overview summary fashion, 

known. Hence, I will not explain loop unrolling in any the processing inherent in memory reference reordering 

further detail. 60 process 50. 

After step 53 fully is executed, step 54 is executed to Though I have shown and described the best implemen- 

process intermediate code 5 which was generated by the tation for the preferred embodiment of my invention, I 

loop unrolling step 53. Specifically, through execution of realize that memory reference reordering process 50 

step 54, multiple assignment statements having the same includes many processing steps. Nevertheless, some degree 

target variable(s) in this code are detected, and if it is 65 of cache conflict reduction could be achieved, if in lieu of 

possible to rename the variable(s), the variable(s) is using my entire implementation of process 50, i.e., using 

renamed. Through step 54, other than when necessary, the loop unrolling method determination step 52 and loop 



01/29/2004, EAST version: 1.4.1 



5,862,385 

19 20 

unrolling step 53, variable renaming step 54 and common and any other cache reference group, e.g., group 9010. If 

memory array element elimination step 55, only one or a such a cache conflict exists, then decision step 524 routes 

sub -set of these steps were to be used instead. execution, via its true path, to step 526. Alternatively, if no 

For example, if source program 3 were to be provided such cache conflict exists, then steps 526 through 532 are 

with its innermost loop having been sufficiently unrolled, 5 skipped, with decision step 524 routing execution, via its 

then acceptable performance would result even if either loop false path, to repeat step 534, to assess cache conflict for the 

unrolling method determination step 52 or loop unrolling next cache reference group CAG(i+l), or if all such groups 

step 53 was not performed. Conversely, for source programs have been processed, proceed to step 536. 

which are provided with absolutely no loop unrolling, then Through execution of step 526, one memory reference in 

these steps are essential and must be performed. Hence, as 10 cache reference group CAG(i) is selected, with the address 

will be plainly evident to one skilled in the art, many such calculation expression 9090 (see FIG. 17) therefor being 

partial embodiments of my inventive method can be devised assigned to variable AE. Next, step 528 is executed to 

and implemented, but for brevity I will not describe them in determine the unrolling loop candidate entry 84. 

any further detail here. Specifically, for the possible loops to be unrolled, the stride 

Inasmuch as intermediate code 5 experiences a variety of 15 for each loop is determined. For any loop, the stride repre- 

successive changes as the code progresses from loop unroll- sents a distance, in memory addresses and in absolute value 

ing step 52 through common array element elimination step terms, between a memory reference in one iteration of the 

55 as shown, e.g., in FIG. 13, then, prior to realigning the loop, i.e., at address AE(ii), and the same reference in the 

intermediate code in step 57, one can alternatively next iteration of that loop, i.e., at address AE(ii+inc), with 

re-execute cache conflict analysis process 30 (see FIG. 1), 20 variable inc defining an increment of the loop from loop 

including step 51 (see FIG. 13) in which cache reference iteration ii. The loop having the lowest stride for the one 

group numbers, e.g., number 9010, are written to interme- memory reference in CAG(i) is selected as a candidate loop 

diate code 5, just prior to executing step 57. Since the to be unrolled for cache reference group CAG(i). 

effectiveness of cache conflict reduction process 20 does not Next, step 530, which when executed, selects the candi- 

markedly change, i.e., increase by much, I see no need to 25 date loop unrolling count, i.e., the number of times a loop is 

explain this alternate technique in any further detail. to be unrolled, for the candidate loop selected through step 

Now, returning to memory reference reordering process 528 for cache reference group CAG(i). Specifically, the 

50 shown in FIG. 13, 1 will now describe, in detail, both loop candidate unrolling count is selected, for the candidate loop, 

unrolling method determination step 52 and intermediate as a maximum value of all unrolling counts for that loop 

code reordering step 57. First, I will explain the loop 30 such that, after this loop is unrolled, all the memory refer- 

unrolling method determination step 52. ences in CAG(i) are included within one memory block. 

FIG. 14 is a detailed flowchart showing the loop unrolling Thereafter, step 532 is executed to write an entry for cache 
method determination step 52 shown in FIG. 13. reference group CAG(i) into table 8 (see FIG. 22B) con- 
First, upon entry into step 52 as shown in FIG. 14, step „ Gaining a cache reference group number 82, and, for this 
521 is executed which, using cache conflict information 9, group, the loop to be unrolled for this group, i.e., (optimal) 
generates loop unrolling candidate table 8. This table, when unrolling loop candidate 84, and loop unrolling count 86 
fabricated, contains all loop unrolling candidates from therefor. 

which the loop(s) that is ultimately unrolled is selected. FIG. 22B shows the loop unrolling candidate table 8 

FIG. 22A shows the structure of loop unrolling candidate 40 generated through the above-described procedure for cache 

table 8. In this table, for each separate cache reference conflict information 9 shown in FIG. 18. The particular loop 

group, such as group 9010 shown in FIG. 18, that causes a "DO 20 1=1, 255" appears most frequently as unrolling loop 

cache conflict, the table maintains a separate entry contain- candidate 84, with the largest loop unrolling count candidate 

ing a cache reference group number, 82 shown in FIG. 22A, 86 for this loop being "4". Once the loop processing for all 

a suitable unrolling loop candidate 84, and a loop unrolling 45 the cache reference groups is completed and table 8 is 

count candidate 86. After loop unrolling candidate table 8 is fabricated as shown in FIG. 22B, then, given the illustrative 

generated through execution of step 521, thereafter step 536, contents of table 8, through execution of steps 536 and 538 

as shown in FIG. 14, is executed to extract the entry for the shown in FIG. 14, this particular loop is selected as the loop 

candidate loop to be unrolled which appears most often in to be unrolled, with its corresponding loop unrolling count 

this table. The loop specified in this entry becomes the loop 50 stored in table 8 specifying the number of times, i.e., the 

that is selected to be unrolled. Thereafter, step 538 is unrolling count, this loop is to be unrolled, 

executed to extract the largest value of the unrolling count Next, I will describe, in detail, as part of cache conflict 

candidate 86 entries stored for the selected loop; this largest reduction process 50 (see FIG. 1), reordering of intermediate 

value becomes the loop unrolling count, i.e., the number of code step 57 (hereinafter, for simplicity, referred to as 

times this selected loop is to be unrolled. 5S "intermediate code reordering" step 57) shown in FIG. 13. 

From the above, the optimal loop unrolling method, i.e., FIG. 15 is a detailed flowchart of intermediate code 

the loop unrolling parameters, is determined for cache reordering step 57. The operations undertaken by step 57 are 

conflict reduction process 20. based on "list rescheduling", which is a conventional pro- 

With the above in mind, I will now proceed to explain, in cess of generally reordering instructions to increase overall 

detail, loop unrolling candidate table generation step 521. eo program execution speed. List rescheduling is discussed in 

During execution of loop unrolling candidate table genera- references, such as ACM SIGPLAN Notices, Vol. 25, No. 7, 

tion step 521, steps 522 to 534 are separately repeated for July 1990, pages 97-106. 

each different cache reference group CAG(i). Clearly, other methods of reordering instructions other 

Specifically, upon entry into step 521, execution proceeds, than that specifically used in the preferred embodiment can 

via repeat step 522, to decision step 524. Using cache 65 be employed. However, to reduce cache conflict for a given 

conflict information 9, decision step 524 determines whether memory block, the order of memory references must be 

cache conflict exists between cache reference group CAG(i) adjusted before all referencing, to memory blocks previ- 



01/29/2004, EAST version: 1.4.1 



5,8( 

21 

ously transferred, to a cache block would have been 
complete, so that the given memory block will not be 
rejected from cache. To do this, in this preferred 
embodiment, selection processing has been added midway 
through the processing of step 57 in form of scheduling 
instructions — which, as described below, are provided by 
step 584. Furthermore, steps 582, 586 and 590 are used, as 
described below, to ensure proper operation in those 
instances where memory reference reordering can not be 
accomplished. 

In particular, intermediate code reordering step 57 is 
formed of two major steps. 

First, through execution of step 570 shown in FIG. 15, 
each statement of intermediate code 5 is allocated to sched- 
ule table 26. Schedule table 26 shows execution statements 
for each time slot of program execution; this table has the 
format shown in FIGS, 24Aand 24B; the correct orientation 
of the drawing sheets for these figures being shown in FIG. 
24. Each line in this table is composed of time slot value 303, 
statement number 305, three address statement 320 and 
reference group number 330. When multiple statements are 
executed in one time slot, schedule 26 is constructed of 
columns 305, 320 and 330 repeated the appropriate number 
of times equal to the number of statements so executed in 
that slot. 

Second, after step 570 has allocated each statement of 
intermediate code 5 to schedule table 26, step 598 is 
executed to reorder intermediate code 5 according to the 
order within schedule table 26. 

Given that overview of step 57, 1 will explain, in detail, 
the processing inherent in step 570 to allocate each state- 
ment of intermediate code 5 to the schedule table. 

In particular, upon entry into step 570, execution proceeds 
to step 572 which, when executed, generates data depen- 
dency graph 24 for intermediate code 5 as it then exists. FIG. 
23 shows illustrative data dependence graph 24 for inter- 
mediate code 5 shown in FIGS. 21Aand 21B. This particular 
graph depicts the execution order inter-relationships for the 
statements in intermediate code 5. Once this graph is con- 
structed through step 572, step 574, shown in FIG. 15, is 
executed to prioritize, through a critical path method, the 
nodes in data graph 24. 

Graph 24 shows the execution order inter-relationships 
between each statement of intermediate code 5 represented 
as a node, and an arc(s) connecting each such node to those 
which need to precede or follow that node, i.e., the corre- 
sponding statements, in the execution order. Additional data 
has been added to each node and arc. For example, for any 
statement represented by a node, a number on the left side 
in a field of a node, e.g., node 242, is statement number 305 
for the statement corresponding to that node, and the state- 
ment on the right side is the corresponding three address 
statement 320. The numbers added to the right of each node 
are the priority and reference group number 330, when the 
statement is executed. However, reference group number 
330 is only added to statements that include memory refer- 
ences that cause cache conflict. 

As an example, arc 244 represents that the statement of 
node 242 must be executed before the statement of node 246 
is executed. The numbers added to the arc show the duration 
of execution of the preceding statement, such as "2" in the 
case of the arc connecting these two nodes. Priority through 
the critical path method is determined by a duration of the 
longest path from each node to a final node. 

After step 574, shown in FIG. 15, has fully executed to 
determine priority for all the arcs in the graph, step 576 is 



12,385 

22 

executed to initialize variable t, which specifies the current 
time slot, to 1. Thereafter, step 578 is executed, which, for 
statements that have not yet been allocated to the schedule 
table, detects a set of memory reference statements (SRs) 
5 which can be executed in time slot t. However, a statement 
that is able to execute at time slot t means that all immedi- 
ately preceding statements, having been allocated to the 
schedule table, must have completed their execution before 
time slot t. 

10 Next, execution proceeds to decision step 580. This step 
determines whether or not the set of SRs produced by step 
578 is empty. If this set of SRs is empty, i.e., no statements 
are seen as being executable in time slot t, then decision step 
580 routes execution, via its true path, to decision step 592. 

15 Alternatively, if this set of SRs is not empty, decision step 
580 routes execution, via its false path, to step 582 which, 
in turn, saves the current contents of this set to set SRorig. 

Once step 582 has executed, execution proceeds to step 
584. The latter step removes from the set of SRs the 

20 intermediate code for memory reference statements that 
belong to cache reference groups which conflict with other 
cache reference groups and for which only a portion of those 
statements is allocated to the schedule table. Through step 
584, the order of memory references is controlled, i.e., 

25 specifically re-ordered, to reduce cache conflict before all 
referencing to a memory block previously transferred to a 
cache block would have been completed, so that the memory 
block is not rejected from cache. 

30 Specifically, through step 584, allocation of the memory 
reference statements, that have been removed from the set of 
SRs, to the schedule table has not been stopped, but rather, 
just delayed. In particular, each memory reference statement 
that has been removed from the set of SRs is allocated to the 

35 schedule table after all other statements in the cache refer- 
ence groups have been allocated, the latter statements 
including statements which would cause cache conflict with 
the removed statement and those statements in the cache 
group that have been allocated to the scheduling table. 

40 After step 584 completes its execution, execution pro- 
ceeds to step 586 which determines whether or not the set of 
SRs is empty. If the set of SRs is empty, i.e., all its statements 
have been allocated to the schedule table, then decision step 
586 routes execution, via its true path, to step 590. 

45 Otherwise, if the set of SRs is not empty, then decision step 
586 routes execution, via its false path, to step 588. Thus, 
one can see that step 586 effectively determines whether 
there are no statements which can be executed at time slot 
t other than memory reference statements which cause cache 

50 conflict. In other words, when the set of SRs is empty, there 
are no statements which can be executed at time slot t other 
than memory references which cause cache conflict, hence 
execution proceeds forward to step 590. Alternatively, when 
the set of SRs is not empty, there are statements which can 

55 be executed at time slot t which do not cause cache conflict, 
thus effectively directing execution forward to step 588. 

As noted above, step 590 is executed when there are no 
statements which can be executed at time slot t other than 
memory references which cause cache conflict. When 

60 executed, step 590 resets the set of SRs to select a statement 
to be allocated to time slot t from among the memory 
reference statements which cause cache conflict. In other 
words, the set of SRs is reset to the contents of set SRorig, 
which is the contents of the set of SRs at the point this set 

65 was first fabricated through execution of step 578. In this 
case, cache conflict is always generated, but as described 
above, the only statements which can be executed at this 



01/29/2004, EAST Version: 1.4.1 



5,St 

23 

time, i.e., time slot t, are memory reference statements. 
Since it is rare to only have statements which cause cache 
conflict, this situation poses no great problem. 

Alternatively, when such conflicting memory reference 
statements potentially executable at time slot t remain, then 
one can cease statement allocation at time slot t, increase the 
time slot value, t, by 1, and continue processing downward 
from step 578 in an attempt to allocate statements, including 
the conflicting memory reference statements, at time slot 
t+1. However, with this approach, unless one were to fix, 
i.e., limit, the number of time slots which are allowed to 
remain empty, i.e., have nothing allocated to them, process- 
ing may continue indefinitely. In view of the above caveat, 
this alternate approach can certainly be used within my 
inventive method, and particularly with step 570. 

In any event, step 588, when executed, allocates one of 
the highest priority statements in the set of SRs to the time 
slot t portion of the schedule table. Once this occurs, 
execution loops back to step 578. In particular, when step 
588 is executed immediately after step 586, then of the 
statements, which can be executed at time slot t, those that 
do not cause cache conflict and which have the highest 
priority are allocated to the schedule table. Alternatively, 
when step 588 is executed immediately after step 590, then 
statements which cause cache conflict can be allocated to the 
schedule table, but only those statements having the highest 
priority are selected. There are various existing conventional 
methods for selecting one of many statements that has the 
highest priority which can be used in implementing step 588. 
Since the particular method chosen is not particularly rel- 
evant to my present invention, I will omit any details of these 
methods herein. 

Eventually, decision block 580 routes execution, over its 
true path, to decision step 592. Step 592, when executed, 
determines whether there are statements which have not yet 
been allocated to the schedule table. If such statements 
remain to be allocated, then decision step 592 routes 
execution, via its true path, to step 596, which increments 
the current time slot value, t, by one. Thereafter, execution 
loops back to step 578 where processing is repeated for these 
statements. Alternatively, if all the statements in the inter- 
mediate code have been allocated, i.e., no unallocated state- 
ments remain, then processing of step 570 ends, with 
decision block 592 routing execution, via its false path, to 
point 594, and from there to step 598, which has been 
described above. 

FIGS. 24A and 24B collectively show an illustrative 
schedule table 26 to which statements have been allocated 
by the above -described procedure and in accordance with 
data dependency graph 24 shown in FIG. 23, the correct 
orientation of the drawing sheets for FIGS. 24Aand 24B is 
shown in FIG. 24. As one can now see, memory reference 
statements for the intermediate code and belonging to cache 
conflict groups, e.g., group 9010, with cache reference group 
numbers 1 (stored within cache reference group number 
field 9020 shown in FIG. 18), 2 and 5 which would other- 
wise cause cache conflict with each other, as illustrated in 
FIGS. 21 A and 21B, through processing provided by, e.g., 
step 584, are now sequentially lined up without being mixed 
together. Cache conflict groups 9010 with cache reference 
group numbers (9020) 3 and 4 are the same. From this, 
memory blocks are prevented from being rejected by cache, 
before all referencing to memory blocks previously trans- 
ferred to a cache block, would otherwise have been com- 
pleted. 

FIGS. 25Aand 25B collectively show intermediate code 
5 after it has been reordered according to schedule table 26 



2,385 

24 

shown in FIGS. 24A and 24B. In other words, intermediate 
code 5 shown in FIGS. 25A and 25B is intermediate code 
which resulted from use of cache conflict reduction process 
20, shown in FIG. 1. After cache conflict reduction process 

5 20, code generation process 60 of compiler 1 is performed 
on the intermediate code shown in FIGS. 25 A and 25B to 
generate object program 7, generally shown in FIG. 1 and 
specifically shown collectively in FIGS. 26 A and 26B; the 
proper orientation of the drawing sheets for FIGS. 26A and 

10 26B, is depicted in FIG. 26, 

FIG. 27 shows the cache conflict for object program 7 
shown in FIGS. 26 A and 26B. As explained previously, FIG. 
7 shows corresponding cache conflict when my inventive 
cache conflict reduction process 20 is not used on the same 

15 source program. As is clearly evident in FIG. 27, one can see 
that, in contrast to that shown in FIG. 7, there are absolutely 
no cache misses caused by cache conflicts. 

I will now discuss the quantitative effects of this preferred 
embodiment. With computers currently available on the 

20 market, instruction execution time normally consumes one 
cycle per instruction, but when a cache miss occurs, such 
execution time can consume approximately 10 such instruc- 
tion cycles. For the following discussion and for simplicity, 
I will estimate execution times using these values. 

25 First, from the cache conflict generation conditions shown 
in FIG. 7, when object program 5 shown in FIG. 6 was 
executed, without having previously performed my inven- 
tive cache conflict reduction method, an execution time of 4 

30 loop cycles is obtained. For each of 4 iterations through the 
single loop, 6 cache misses are generated. Given my 
assumption of 10 cycles for each cache miss, and since the 
number of instructions for which a cache miss was not 
generated is 7, the total instruction execution time for the 

35 program shown in FIG. 6 is 4*(6*10 cycles+7 cycles)=268 
cycles. 

In contrast, from the cache conflict generation conditions 
shown in FIG. 27, i.e., for object program 5 of FIGS. 26 A 
and 26B which results from performing my inventive cache 

40 conflict reduction process on the same source program as 
that which resulted in the object code shown in FIG. 6, given 
the same number of loop cycles, i.e., 4, the total execution 
time is the number of cache misses, here 6, multiplied by 10 
cycles/miss plus time needed to execute the remaining 

45 instructions which did not generate a cache miss, i.e., 33 
cycles. The resulting execution time is 6*10+33-93 cycles. 
Therefore, through use of my inventive cache conflict reduc- 
tion process, the execution time, for the same source code, 
decreased from 268 to 93 cycles. In other words, execution 

50 time, as measured in cycles, decreased by a factor of 93/268 
which equals an increase in execution speed of approxi- 
mately 2.9 times. 

Furthermore, by using the preferred embodiment to reor- 
der the memory references in program intermediate code 5 

55 through execution of memory reference reordering process 
50, any extra dependency caused by allocating a few reg- 
isters to the memory reference instructions does not occur. 
Hence, use of my inventive method provides increased 
freedom to reorder memory references. Consequently, 

60 memory references can be readily transferred to optimal 
positions with a concomitant improvement in the amount of 
cache conflict reduction that results. 

Further, with my preferred embodiment, memory refer- 
ence reordering process 50 was described as being executed 

65 on program intermediate code 5. However, process 50 can 
also be executed on object program 7. In the latter case, 
instruction reordering for all instructions, including those 



01/29/2004, EAST Version: 1.4.1 



5,862,: 

25 

which were not expected to be generated at the intermediate 
code stage, can be performed, which will likely produce an 
increasingly precise reordering. 

Additionally, memory reference reordering process 50 
can also be executed on both program intermediate code 5 5 
and object program 7. Since this case simultaneously pro- 
vides all the advantages set forth in the preceding two 
paragraphs, optimal cache conflict reduction can be 
achieved. 

Moreover, memory reference reordering process 50 can 10 
be executed on program intermediate code 5, with the 
resulting intermediate code being reverse compiled 
(reconverted) to a corresponding source program, with this 
source program then being externally disseminated for use 
on computers other than that which generated this source 15 
program. Hence, through this approach, other compilers can 
also use the advantageous effects of cache conflict reduction 
process 20 which will have been incorporated into the 
source program, thereby enhancing the general use of cache 
conflict reduction process 20. 20 

I have shown and described the preferred embodiment of 
my inventive method, in the context of illustrative use with 
direct map method cache. However, those skilled in the art 
will realize that my inventive method is not limited to only 25 
direct map method cache, but rather can be used with other 
forms of cache, such as set associative method cache. 

Through use of my invention, programs, which would 
otherwise exhibit a large decrease in performance due to 
cache conflicts, can generate fewer cache conflicts. As a 30 
result, program execution speed can be increased. My inven- 
tion is very useful when compiling programs, operating on 
computers, which have direct map method cache. 

Clearly, it should now be quite evident to those skilled in 
the art, that while my invention was shown and described, in 35 
detail, in the context of a preferred embodiment, and with 
various modifications thereto, a wide variety of other modi- 
fications can be made without departing from the scope of 
my inventive teachings. 

I claim: 40 

1. A compiling method for reducing cache conflicts which 
would be generated during the execution of a program, 
wherein the method compiles a source program of the 
program into an object program for a computer having a 
main memory and a data cache memory, the method com- 45 
prising steps of: 

a) analyzing syntax of said source program and generating 
initial intermediate code; 

b) extracting memory references from said intermediate 
code; 50 

c) generating a cache reference group for each group of 
said memory references that references a common 
memory block so as to form a plurality of cache 
reference groups; ^ 

d) classifying said cache reference groups and generating 
cache reference group information; 

e) based on said cache reference group information, 
generating a cache conflict graph which shows the 
cache conflict conditions among said cache reference 60 
groups; 

f) providing said cache reference group information and 
said cache conflict graph as said cache conflict infor- 
mation; 

g) based on said cache conflict information, changing an 65 
execution order of memory reference codes in said 
initial intermediate code, so as to form reordered inter- 



26 

mediate code which, when compiled into object code 
and ultimately executed, will generate fewer cache 
conflicts than would said initial intermediate code, if 
said initial intermediate code were to be compiled into 
object code and executed; and 
h) generating an object program from said reordered 
intermediate code. 

2. The method in claim 1 wherein the step (d) comprises 
the steps of: 

determining, in response to whether a distance on said 
main memory between memory references in said 
initial intermediate code is less than a designated value, 
if said memory references refer to a common one of the 
memory blocks; and 

registering those of said memory reference, which refer to 
a common one of the memory blocks, to a common one 
of the cache reference groups. 

3. The method of claim 1 wherein the step (d) further 
comprises the steps of: 

obtaining the memory block which references the 

positions, in said main memory, of memory references 

of said intermediate code, and 
registering those of said memory references, which refer 

to the common memory block, to a corresponding 

common cache reference group. 

4. The method of claim 1 wherein the step (e) comprises 
the steps of: 

obtaining, for arbitrary first and second ones of the cache 
reference groups included in said cache reference group 
information, a minimum distance value in the cache 
memory between all of the memory references included 
in the first cache reference group and all of the memory 
references included in the second cache reference 
group; and 

determining, in response to whether said minimum dis- 
tance is less than the designated value, an occurrence of 
cache conflict between the memory references of said 
first cache reference group and the memory references 
of said second cache reference group. 

5. The method claim 1 wherein the step (e), for arbitrary 
first and second cache reference groups included in said 
cache reference group information, comprises the steps of: 

obtaining a first one of the cache blocks, said first cache 
block being mapped from positions in said main 
memory of all of the memory references that are 
included in said first cache reference group; 

obtaining a second one of the cache blocks, said second 
cache block being mapped from positions in said main 
memory of all of the memory references that are 
included in said second cache reference group; and 

determining, in response to whether ones of the memory 
reference are mapped to a same one of the cache 
blocks, existence of cache conflict between the memory 
references of said first cache reference group and the 
memory references of said second cache reference 
group. 

6. A compiling method for reducing cache conflicts which 
would be generated during the execution of a program, 
wherein the method compiles a source program of the 
program into an object program for a computer having a 
main memory and a data cache memory, the method com- 
prising steps of: 

a) analyzing syntax of said source program and generating 
initial intermediate code; 



01/29/2004, EAST Version: 1.4.1 



5,862,: 

27 

b) detecting data cache conflicts which would be gener- 
ated by memory reference codes included in said initial 
intermediate code and providing conditions for 
memory reference codes to generate cache conflicts as 
cache conflict information; 5 

c) based on said cache conflict information, changing an 
execution order of memory reference codes in said 
initial intermediate code, so as to form reordered inter- 
mediate code which, when compiled into object code 
and ultimately executed, will generate fewer cache 10 
conflicts than would said initial intermediate code, if 
said initial intermediate code were to be compiled into 
object code and executed; 



d) selecting the loop from one of a plurality of loops in 
said initial intermediate codes so as to identify a 
selected loop; 

e) determining a number of times the selected loop is to 
be unrolled, such that a length of each memory refer- 
ence that results once the selected loop is unrolled 
equals a block length of one of the cache blocks, and 
unrolling the selected loop said number of times; and 

f) changing the order of said memory reference code in 
said intermediate code after the selected loop, which 
provides repetitive processing in said initial interme- 
diate code, is unrolled. 



01/29/2004, EAST Version: 1.4.1 



(12) United States Patent 

Geva " 



i ufl dim di 111 nn iui idii ii nil in in uni mi in m 

US006539541B1 

(io) Patent No.: US 6,539,541 Bl 
(45) Date of Patent: Mar. 25, 2003 



(54) METHOD OF CONSTRUCTING AND 

UNROLLING SPECULATIVELY COUNTED 
LOOPS 

(75) Inventor: Robert Y. Geva, Cupertino, CA (US) 

(73) Assignee: Intel Corporation, Santa Clara, CA 
(US) 

( * ) Notice: Subject to any disclaimer, the term of this 
patent is extended or adjusted under 35 
U.S.C 154(b) by 0 days. 

(21) Appl. No.: 09/378,632 

(22) Filed: Aug. 20, 1999 

(51) Int. CI. 7 G06F 9/45 

(52) U.S. CI 717/150; 717/151; 717/160; 

712/241; 712/233; 712/239 

(58) Field of Search 717/9, 5, 150, 

717/151, 160; 712/216, 218, 233, 241, 

239 

(56) References Cited 

U.S. PATENT DOCUMENTS 

5,361,354 A * 11/1994 Greyzck 717/160 

5,526,499 A 6/1996 Bernstein et al 712/216 

5,537,620 A * 7/1996 Breternitz, Jr 717/9 

5,613,117 A ♦ 3/1997 Davidson et al 717/144 

5,664,193 A * 9/1997 Tirumalai 717/5 

5,724,536 A 3/1998 Abramson et al 712/216 

5,778,210 A 7/1998 Henstrom et al 712/218 

5,797,013 A * 8/1998 Mahadevan et al 717/9 

5,802,337 A 9/1998 Fielden 712/216 

5^809,308 A * 9/1998 Tirumalai 717/9 

5,835,776 A * 11/1998 Tirumalai et al 717/9 

5,842,022 A * 11/1998 Nakahira et al 717/9 

5.854.933 A * 12/1998 Chang 717/9 

5.854.934 A * 12/1998 Hsu et al 717/9 

5,862,384 A 1/1999 Hirai 717/9 

6,035,125 A * 3/2000 Nguyen et al 717/9 



6,145,076 A * 11/2000 Gabzdyl et al 712/241 

6,192,515 Bl * 2/2001 Doshi et al 717/9 

6,247,173 Bl * 6/2001 Subrahmanyam 717/9 

6,263,427 Bl * 7/2001 Cummins et al 712/236 

6,263,489 Bl * 7/2001 Olsen et al 717/129 

6,269,440 Bl * 7/2001 Fernando et al 709/106 

6,289,443 Bl * 9/2001 Scales et al 708/300 

6,327,704 Bl * 12/2001 Mattson, Jr. et al 717/153 

6,343,375 Bl * 1/2002 Gupta et al 717/152 

6,367,071 Bl * 4/2002 Cao et al 717/160 

6,401,196 Bl * 6/2002 Lee et al 711/213 

OTHER PUBLICATIONS 

TITLE: Improving Instruction Level Parallelism by loop 
unrolling dynamic memory disambiguation, ACM, David- 
son et al, Dec. 1995.* 

TITLE: Combining Loop Transformation considering cach- 
ing and scheduling, ACM, author: Wolf et al, 1996.* 
TITLE: Unrolling Loops with indeterminate loop counts in 
system level pipelines, Guo et al, IEEE, 1998.* 
TITLE: Symbolic range propagation, author: Blume et al, 
IEEE, 1995.* 

"Advanced Compiler Design & Implementation" by Steven 
S. Muchnick, cover, table of contents and pp. 547-569, 
Copyright 1997. 

* cited by examiner 

Primary Examiner— Kaks^i Chaki 
Assistant Examiner — Chameli C. Das 
(74) Attorney, Agent, or Firm— Peter Lam 

(57) ABSTRACT 

A method of constructing and unrolling speculatively 
counted loops. The method of the present invention first 
locates a memory load instruction within the loop body of a 
loop. An advance load instruction is inserted into the pre- 
header of the loop. The memory load instruction is replaced 
with a check instruction. The loop body is unrolled. A 
cleanup block is generated for said loop. 

32 Claims, 4 Drawing Sheets 




01/29/2004, EAST Version: 1.4.1 



U.S. Patent Mar. 25, 2003 Sheet 1 of 4 US 6,539,541 Bl 




FIG. 1 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Mar. 25, 2003 Sheet 2 of 4 US 6,539,541 Bl 



3 



eg =3 

CC + >* > 

; • *! + 

+ + 5 cj 

■ 2 II » S3 " . 
9 in — n (C o : 



CD 



CB 



CM 

CC 

II 

X 



3 

« 5g + 



+ + 



CO 



II O -■ " - 



J2 i is ii » -c n 



II II 




5 

CD 

u: 



c 
o 

O) 

£ _ + 

CM S=>w 

>» « « 
-Q + + 
r W (0 

II II II 

° S 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Mar. 25, 2003 Sheet 3 of 4 US 6,539,541 Bl 



O 
DC 

! 



V 0) 

3~ o** 

>• co >. « > + 

^ "2 ■* "S "2 t 



Q. 
O 
O 



> 

o 

o 

0) 
QC 



O 
O) 

CM 



O 



0) 

c 
o 

Q 



o 

D) 



O 

Q) 



a> 
c 
o 
Q 



a o 



*# —i oc 



8 b 



c ^ 

II ==- 

±$ 

w o 
2= CO 



o 

+ V 



o 

CD 

0) 

(0 

"5 



0) 

c 
o 
o 



CD 



3*- c ® 

-a .1 a* 

O o ID S *■* 
II O X 
I UJ 



5 

CD 




01/29/2004, EAST Version: 1.4.1 



U.S. Patent Mar. 25, 2003 Sheet 4 of 4 US 6,539,541 Bl 




Locate Load Of 
Upper Bound 
535 




t 



-Yes 



Insert Advance Load At 
Preheader 
540 




Optimize As "While- 
Loop 
530 



Optimize As Counted 
Loop 
520 



I 

Add A Cleanup Loop 
545 




01/29/2004, EAST Version: 1.4.1 



US 6,539,541 Bl 



METHOD OF CONSTRUCTING AND 
UNROLLING SPECULATIVELY COUNTED 
LOOPS 

FIELD OF THE INVENTION 

This invention relates to the field of computer software 
optimization. More particularly, the present invention relates 
to a method of constructing and unrolling speculatively 
counted loops. BACKGROUND OF THE INVENTION 

Computer programs are generally created as source code 
using high-level languages such as C, C++, Java, 
FORTRAN, or PASCAL. However, computers are not able 
to directly understand such languages and the computer 
programs have to be translated or compiled into a machine 
language that a computer can understand. The step of 
translating or compiling source code into object code pro- 
cess is performed by a compiler. Optimizations are mecha- 
nisms that provide the compiler with equivalent ways of 
generating code. Even though optimizations are not neces- 
sary in order, for a compiler to generate code correctly, 
object code may be optimized to execute faster than code 
generated by straight forward compiling algorithm if code 
improving transformations are used during code compila- 
tion. Loop unrolling is one such optimization that can be 
used in a compiler. 

For the purpose of loop unrolling, loops are categorized as 
follows. A loop is counted if the number of iterations that the 
loop will execute is determined once execution reaches the 
loop. Counted loops are also referred to as "for" loops. 
Conversely, a loop has data dependent exit, loosely called a 
"while'" loop, if the number of iterations is determined 
during the execution of the loop. Counted loops are further 
classified. If the compiler can determine the number of 
iterations that the loop will execute at compile time, then the 
number of iterations is a compile time constant. Otherwise, 
the number of loop iterations is variable. 

A compiler can optimize counted loops better than 
"while" loops. Applying the counted loop optimization to a 
"while" loop will cause the compiler to generate incorrect 
code. Therefore, in order to ensure the generation of correct 
code, the compiler's default assumption must be that all 
loops are "while" loops. Then the compiler may later try to 
prove that a loop is a counted loop so that more optimiza- 
tions become possible. Similarly, the compiler can optimize 
a compile time constant counted loop to execute more 
efficiently than a variable counted loop. Furthermore, apply- 
ing compile time constant loop optimizations to a variable 
counted loop will generate incorrect code. The compiler's 
default assumption has to be that all loops are variable, and 
only if the compiler succeeds in proving that a counted loop 
is a compile time constant counted loop, can the compiler 
proceed to apply further optimizations. 

For example, here are two possible optimizations that a 
compiler can apply only to compile time constant counted 
loops. In one possible optimization, the compiler may unroll 
the loop entirely. Typically, compilers will unroll a loop 
entirely if the trip count is determined to be small, e.g. eight 
or less. A second optimization that a compiler may apply to 
loops with large compile time constant trip counts is to chose 
an unrolling factor that divide the trip count evenly. If the 
loop is variable or if such an optimal factor caa not be found 
(e.g. if the trip count is a large prime number), then a cleanup 
loop must be generated after the unrolled loop to execute the 
remainder of the iterations. 

Compilers can also optimize counted loop to execute 
more efficiently than "while" loops. In the context of loop 
unrolling, when the compiler unrolls a "while" loop, the 



10 



20 



30 



35 



45 



50 



55 



60 



65 



compiler has to simply copy the whole loop as many times 
as given by the unrolling factor chosen. This copy step 
includes the loop overhead. To illustrate, consider the fol- 
lowing scheme: 
LOOP: 
BODY(I) 

I«SOME_NEW„VALUE(I) 

If (CONDITION(I)) 
GOTO LOOP 

ELSE 
GOTO LOOP_END 
LOOP_END: 

BODY(I) is the useful part of the loop that does the real 
work in an iterative way. CONDITION is some test state- 
ment involving a variable "I" that changes in at least some 
of the loop iterations and that determines whether the loop 
terminates or continues execution. Unrolling this "while" 
loop by an unrolling factor of three yields the following 
construct: 
LOOP: 
BODY(I) 

I-SOME_NEW__VALUE(I) 
If (NOT CONDITION©) 

GOTO LOOP_EXIT 
BODY(I) 

I«SOME_NEW_VALUE(I) 
If (NOT CONDITION©) 

GOTO LOOP_EXIT 
BODY(I) 

I=SOME_NEW_VALUE(I) 
IF (CONDITION(I)) 

GOTO LOOP 
ELSE 

GOTO LOOP_EXTT 
LOOP_EXIT: 

When a compiler unrolls a counted loop, the compiler can 
save the loop overhead. The compiler can generate loop 
overhead code only once in each new iteration that corre- 
sponds to several original iterations. Consider the following 
counted loop construct: 
1=0; 

N=some_unknown_value; 
LOOP: 
BODY(I) 
M+l 
If (I<N) 

GOTO LOOP 
ELSE 

GOTO LOOP_EXIT 
LOOP_JiXIT 

Assume that the compiler decided to unroll this loop by an 
unrolling factor of three. The compiler has to generate code 
that will verify, at execution time, that the loop is about to 
execute at least three iterations. Also, the upper bound in the 
unrolled loop must now be reduced to N-2, and a cleanup 
loop must be generated to execute the remainder of the 
iterations. The resulting code will look like: 
1-0 

N=some_unknown_value 

If (N<3) GOTO IN_BETWEEN 

LOOP: 

BODY(I) 

BODY0+1) 

BODY(I+2) 

If (I<N-2) GOTO LOOP 
ELSE GOTO INJETWEEN 



01/29/2004, EAST Version: 1.4.1 



US 6,5: 

3 

IN-BETWEEN: 

IF (I>=N) GOTO LOOP_EXIT 
CLEANUP: 

BODY(I) 

l-I+l 

IF(I<N) 
GOTO CLEANUP 

ELSE 

GOTO LOOP_EXIT 
LOOP__EXIT 

If the value of l N' is large enough, most of the execution 
time will be spent in the unrolled loop. The added control 
around the loop has a negligible effect on performance. 
Significant performance is gained from not having to 
execute the loop overhead. Hence the compiler's ability to 
prove that a given loop is counted is a key in achieving this 
performance gain. 

SUMMARY OF THE INVENTION 

A method of constructing and unrolling speculatively 
counted loops is described. The method of the present 
invention first locates a memory load instruction within the 
loop body of a loop. An advance load instruction is inserted 
into the preheader of the: loop. The memory load instruction 
is replaced with an advanced load check instruction. The 
loop body is unrolled. A cleanup block is generated for said 
loop. 

Other features and advantages of the present invention 
will be apparent from the accompanying drawings and from 
the detailed description that follow below. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention is illustrated by way of example 
and not limitations in the figures of the accompanying 
drawings, in which like references indicate similar elements, 
and in which: 

FIG, 1 is a block diagram illustrating a computer system 
which may utilize the present invention. 

FIG. 2A is an example 'for' loop before loop unrolling; 

FIG. 2B shows the 'for* loop of FIG. 2A after the loop 
unrolling transformation; 

FTG, 3A is a load-store pair in a code stream; 

FIG. 3B shows the code of FIG. 3 A after an advance load.; 

FIG. 4A illustrates a loop before the loop unrolling 
transformation; 

FIG. 4B illustrates the loop of FIG. 4A if unrolled by a 
factor of three as a 'while' loop; 

FIG. 4C illustrates the loop of FIG. 4A unrolled three 
times as a speculatively counted loop; and 

FTG. 5 is a flow diagram that illustrates steps for con- 
structing and unrolling, speculatively counted loops in one 
embodiment of the present invention. 

DETAILED DESCRIPTION 

A method of constructing and unrolling speculatively 
counted loops is disclosed. Although the following embodi- 
ments are described with reference to C compilers, other 
embodiments are applicable to other types of programming 
languages that use compilers. The same techniques and 
teachings can easily be applied to other embodiments and 
other types of compiled object code. 

The state of the art is proving that a loop is a counted loop 
includes the following steps. First, identify a variable, called 



(9,541 Bl 

4 

an induction variable, that changes in every loop iteration. 
The amount of change, called the stride, is usually required 
to be additive and the same for each loop iteration. Then that 
same variable has to be compared to some other value, the 

5 upper bound, in a manner that controls whether the loop will 
terminate or continue to execute more iterations. Also, the 
second value in the comparison must be loop invariant. The 
upper bound can either be a compile time constant or stored 
in a memory location that can not change during the execu- 

1Q tion of the loop. 

Note that the trip count, i.e. the number of iterations that 
the loop execute each time is given by "trip count=(upper 
bound—lower bound)/stride". If the upper bound, lower 
bound, and stride are all compile time constants, then so is 
the trip count. In cases where the upper bound is stored in 

15 a variable, the compiler has to prove that the variable cannot 
change during the execution of the loop. In order to prove 
that, the compiler has to verify that each operation in the 
loop that changes a value stored in some memory location, 
such as a store operation, is targeting a memory location that 

20 is different from the one used to store the value of the loop 
upper bound. The process by which the compiler determines 
whether two memory access operation refer to overlapping 
areas in memory or not is called memory disambiguation, 
and is undecidable. 

25 The enhancement disclosed here is a new way to use the 
data speculative loads, also known as advanced loads. The 
advanced loads are meant to help the compiler promote the 
location of a load instruction beyond store instructions that 
are not disambiguated. The new usage of advanced loads 

30 described in this invention is more powerful in that it allows 
the compiler to change the way it optimizes a whole loop 
rather than simply change the location of a single load 
instruction. The present invention enables a compiler to 
optimize these loops as speculatively counted. Optimizing 

35 certain loops as speculatively counted may allow code 
performance almost as good if the loops were optimized as 
counted loops and better than if the loops were optimized as 
while loops. Thus this invention may allow a compiler with 
such a capability to have a performance advantage over 

4 0 compilers that do not have this technology. As a result, it is 
important for the code optimizations to be effective. 
Therefore, a method of constructing and unrolling specula- 
tively counted loops would be desirable. 
Embodiments of the present invention may be imple- 

45 mented in hardware or software, or a combination of both. 
However, embodiments of the invention may be imple- 
mented as computer programs executing on programmable 
systems comprising at least one processor, a data storage 
system including volatile and non-volatile memory and/or 

50 storage elements, at least one input device, and at least one 
output device. Program code may be applied to input data to 
perform the functions described herein and generate output 
information. The output information may be applied to one 
or more output devices. For purposes of this application, a 

55 processing system includes any system that has a processor, 
such as, for example, a digital signal processor (DSP), a 
microcontroller, an application specific integrated circuit 
(ASIC), or a microprocessor. 
The programs may be implemented in a high level pro- 

60 cedural or object oriented programming language to com- 
municate with a processing system. The programs may also 
be implemented in assembly or machine language. The 
invention is not limited in scope to any particular program- 
ming language. In any case, the language may be a compiled 

65 or interpreted language. 

The programs may be stored on a storage media or device 
(e.g., hard disk drive, floppy disk drive, read only memory 



01/29/2004, EAST Version: 1.4.1 



US 6,539,541 Bl 

5 6 

(ROM), CD-ROM device, flash memory device, digital software and/or data not shown. A cache memory 104 may 

versatile disk (DVD) or other storage device) readable by a reside inside processor 102 that stores data signals stored in 

general or special purpose programmable processing memory 116. Cache memory 104 in this embodiment speeds 

system, for configuring and operating the processing system up memory accesses by the processor by taking advantage of 

when the storage media or device is read by the processing 5 its locality of access. Alternatively, in another embodiment, 

system to perform the procedures described herein, Embodi- the cache memory may reside external to the processor, 

ments of the invention may also be considered to be imple- A bridge/memory controller 114 may be coupled to the 

mented as a machine readable storage medium, configured processor bus 110 and memory 116. The bridge/memory 

for use with a processing system, where the storage medium controller 114 directs data signals between processor 102, 

so configured causes the processing system to operate in a J0 memory 116, and other components in the system 100 and 

specific and predefined manner to perform the function bridges the data signals between processor bus 110, memory 

described herein. 116, and a first input/output (I/O) bus 120. In some 

There are two possible computer systems of interest. The embodiments, the bridge/memory controller provides a 

first system is called a "host". The host includes a compiler. graphics port for coupling to a graphics controller 112. In 

The host system carries out the transformation of construct- 15 this embodiment, graphics controller 112 interfaces to a 

ing and unrolling speculatively counted loops. The second display device for displaying images rendered or otherwise 

system is called a "target". The target system executes the processed by the graphics controller 112 to a user. The 

programs that were compiled by the host system. The host display device may comprise a television set, a computer 

and target systems can have the same configuration in some monitor, a flat panel display, or other suitable display device, 

embodiments. In the compiled program, speculatively 20 First I/O bus 120 may comprise a single bus or a com- 

counted loops can be present. Such a program would use the bination of multiple buses. First I/O bus 120 provides 

data speculation that is implemented in system hardware. communication links between components in system 100. A 

The target computer system has to be one in which the network controller 122 may be coupled to the first I/O bus 

processor has data speculation implemented. 120. The network controller links system 100 to a network 

An example of one such processing system is shown in 15 that may include a plurality of processing system and 
FIG. 1, Sample system 100 may be used, for example, to supports communication among various systems. The net- 
execute the processing for embodiments of a method of work of processing systems may comprise a local area 
constructing and unrolling speculatively counted loops, in network (LAN), a wide area network (WAN), the Internet, 
accordance with the present invention, such as the embodi- or other network. A compiler for constructing and unrolling 
ment described herein. Sample system 100 is representative 30 speculatively counted loops can be transferred from one 
of processing systems based on the PENTIUM®, PEN- computer to another system through a network. Similarly, 
TIUM® Pro, and PENTIUM® II microprocessors available compiled code that has been optimized by a method of 
from Intel Corporation, although other systems (including constructing and unrolling speculatively counted loops can 
personal computers (PCs) having other microprocessors, be transferred from a host machine to a target machine. In 
engineering workstations, set-top boxes and the like) may 35 some embodiments, a display device controller 124 may be 
also be used. In one embodiment, sample system 100 may be coupled to the first I/O bus 120. The display device con- 
executing a version of the WINDOWS™ operating system troller 124 allows coupling of a display device to system 100 
available from Microsoft Corporation, although other oper- and acts as an interface between a display device and the 
ating systems and graphical user interfaces, for example, system. The display device may comprise a television set, a 
may also be used, 40 computer monitor, a flat panel display, or other suitable 

FIG. 1 is a block diagram of a system 100 of one display device. The display device receives data signals 

embodiment of the present invention. System 100 can be a from processor 102 through display device controller 124 

host or target machine. The computer system 100 includes a and displays information contained in the data signals to a 

processor 102 that processes data signals. The processor 102 user of system 100. 

may be a complex instruction set computer (CISC) 45 In some embodiments, camera 128 may be coupled to the 

microprocessor, a reduced instruction set computing (RISC) first I/O bus to capture live events. Camera 128 may 

microprocessor, a very long instruction word (VLIW) comprise a digital video camera having internal digital video 

microprocessor, a processor implementing a combination of capture hardware that translates a captured image into digital 

instruction sets, or other processor device, such as a digital graphical data. The camera may comprise an analog video 

signal processor, for example. FIG. 1 shows an example of 50 camera having digital video capture hardware external to the 

an embodiment of the present invention implemented as a video camera for digitizing a captured image. Alternatively, 

single processor system 100. However, it is understood that camera 128 may comprise a digital still camera or an analog 

embodiments of the present invention may alternatively be still camera coupled to image capture hardware. A second 

implemented as systems having multiple processors. Pro- I/O bus 130 may comprise a single bus or a combination of 

cessor 102 may be coupled to a processor bus 110 that 55 multiple buses. The second I/O bus 130 provides commu- 

transmits data signals between processor 102 and other nication links between components in system 100. A data 

components in the system 100. storage device 132 may be coupled to second I/O bus 130. 

System 100 includes a memory 116. Memory 116 may be The data storage device 132 may comprise a hard disk drive, 

a dynamic random access memory (DRAM) device, a static a floppy disk drive, a CD-ROM device, a flash memory 

random access memory (SRAM) device, or other memory 60 device, or. other mass storage device. Data storage device 

device. Memory 116 may store instructions and/or data 132 may comprise one or a plurality of the described data 

represented by data signals that may be executed by pro- storage devices. The data storage device 132 of a host 

cessor 102. The instructions and/or data may comprise code machine can store a compiler for constructing and unrolling 

for performing any and/or all of the techniques of the present speculatively counted loops. Similarly, a target machine can 

invention. A compiler for constructing and unrolling specu- 65 store code that has been optimized by with a method for 

latively counted loops can be residing in memory 116 during constructing and unrolling speculatively counted loops can 

code compilation. Memory 116 may also contain additional be stored in data storage device 132. 



01/29/2004, EAST Version: 1.4.1 



US 6,5: 

7 

A keyboard interface 134 may be coupled to the second 
I/O bus 130. Keyboard interface 134 may comprise a 
keyboard controller or other keyboard interface device. 
Keyboard interface 134 may comprise a dedicated device or 
may reside in another device such as a bus controller or other 
controller device. Keyboard interface 134 allows coupling 
of a keyboard to system 100 and transmits data signals from 
a keyboard to system 100. A user input interface 136 may be 
couple to the second I/O bus 130. The user input interface 
may be coupled to a user input device, such as a mouse, 
joystick, or trackball, for example, to provide input data to 
the computer system. 

Audio controller 138 may be coupled to the second I/O 
bus 130, Audio controller 138 operates to coordinate the 
recording and playback of audio signals. A bus bridge 126 
operates to coordinate the recording and playback of audio 
signals. A bus bridge 126. couples first I/O bus 120 to second 
I/O bus 130. The bus bridge 126 operates to buffer and 
bridge data signals between the first I/O bus 120 and the 
second I/O bus 130. 

Embodiments of the present invention are related to the 
use of the system 100 for constructing and unrolling specu- 
latively counted loops. According to one embodiment, such 
processing may be performed by the system 100 in response 
to processor 102 executing sequences of instructions in 
memory 116. Such instructions may be read into memory 
116 from another computer- read able medium, such as data 
storage device 132, or from another source via the network 
controller 122, for example. Execution of the sequences of 
instructions causes processor 102 to construct and unroll 
speculatively counted loops according to embodiments of 
the present invention. In an alternative embodiment, hard- 
ware circuitry may be used in place of or in combination 
with software instructions to implement embodiments of the 
present invention. Thus, the present invention is not limited 
to any specific combination of hardware circuitry and soft- 
ware. 

The elements of system 100 perform their conventional 
functions well-known in the art. In particular, data storage 
device 132 may be used to provide long-term storage for the 
executable instructions and data structures for embodiments 
of methods of constructing and unrolling speculatively 
counted loops in accordance with the present invention, 
whereas memory 116 is used to store on a shorter term basis 
the executable instructions of embodiments of the methods 
of constructing and unrolling speculatively counted loops in 
accordance with the present invention during execution by 
processor 102. 

Although the above example describes the -distribution of 
computer code via a data storage device, program code may 
be distributed by way of other computer readable mediums. 
For instance, a computer program may be distributed 
through a computer readable medium such as a floppy disk, 
a CD ROM, a carrier wave, a network, or even a transmis- 
sion over the internet. Software code compilers often use 
optimizations during the code compilation process in an 
attempt to generate faster and better code. Loop unrolling is 
one optimization that may be applied when code is com- 
piled. An example of a typical loop may be: 
Loop{. , . 

B(i) 

i=i+l 

test(i) 

exit 

• ••} 

There is normally some control overhead such as 'test©' in 
the above example to control the number of loop iterations. 



19,541 Bl 

8 

In loop unrolling, the loop body is copied multiple times. 
The above loop may be unrolled to become: 
Loop{. . . 
B(i) 

5 B(i+1) 
B(i+2) 
B(i+3) 
i=i+4 
test (i) 

10 exit . . . } 

The original loop body B(i) has been copied three times and 
the control variable 'f incremented accordingly. By unroll- 
ing the loop, the branch instruction and test for loop exit 
executes three times less than in the original loop. 

15 Furthermore, fewer instructions are needed for the control 
flow and more instructions are grouped together into a block. 
A large contiguous block of code may also allow for 
subsequent code optimizations. 

Loop unrolling reduces the overhead of executing an 

20 indexed loop and may improve the effectiveness of other 
optimizations, such as common subexpression elimination, 
induction-variable optimizations, instruction scheduling, 
and software pipelining. Loop unrolling generally increases 
the available instruction-level parallelism, especially if sev- 

25 eral other transformations are performed on the copies of the 
loop body to remove unnecessary dependencies. Thus, 
unrolling has the potential of significant benefit for many 
implementations and particularly for superscalar and VLIW 
ones and Explicitly Parallel Instruction Computing (EPIC). 

30 Loop unrolling may also provide other advantages. For 
instance, instruction scheduling or prefetching in some 
computer architectures may benefit from loop unrolling. 
Loop unrolling is often used to enable the generation of data 
prefetch instructions. When a compiler inserts, data prefetch 

35 instructions into loops, the compiler may need to insert those 
instructions into only some iterations of the loop. Unrolling 
the loop makes several iterations explicitly available to the 
compiler such that the compiler can insert instructions to 
some and not all of iterations. In some instances, unrolled 

40 loops may utilize cache memory more efficiently. 
Furthermore, not taken branches may be cheaper in terms of 
performance loss than taken branches. If the compiler pre- 
dicts that the loop will execute many iterations, then a larger 
block of code may be cached in memory and fewer jumps or 

45 branches will be executed. 

When the compiler is analyzing the program code, a loop 
may be completely removed and replaced with a contiguous 
block of code if the number of loop iterations is small. 
Similarly, the number of loop iterations may be reduced 

50 through loop unrolling if the number of iterations is large. 
Referring now to FIGS. 2A and 2B, there are two 
examples of 'for' loops. FIG. 2A illustrates a normal 'for* 
loop before unrolling. FIG. 2B illustrates the 'for* loop of 
FIG. 2 A after the loop unrolling transformation. The unroll- 

55 ing transformation in this example has been oversimplified. 
For this example, the loop bounds are known constants and 
the unrolling factor divides evenly into the number of 
iterations. However, such conditions are generally not sat- 
isfied and the compiler has to keep a cleanup copy of the 

60 loop. When the number of iterations remaining in a loop is 
less than the unrolling factor, the unrolled copy is exited and 
the cleanup copy is executed to complete the remaining 
iterations. This approach also reduces the number of early 
termination tests aud conditional control flow between cop- 

65 ies of the body in some loops. 

When the compiler unrolled the loop of FIG. 2 A by a 
factor of two, the loop body "s:=s+a[i]" was copied twice 



01/29/2004, EAST Version: 1.4.1 



US 6,539,541 Bl 



10 



and the loop counter *i J adjusted as shown in FIG. 2B. The 
unrolled loop executes both the loop-closing test and the 
branch half as many times as the original loop. Hence loop 
unrolling optimization may positively affect system perfor- 
mance. In the present example, loop unrolling increases the 
effectiveness of instruction scheduling by making two loads 
of *a[i]' values available to be scheduled in each loop 
iteration. 

Loops are generally distinguished into two classifications: 
counted loops and loops with a data dependent exit. A loop 
is counted if the number of iterations that the loop will 
execute is determined once execution reaches the loop. 
However, if the number of loop iterations is not determined 
once execution reaches the loop, and is determined during 
the execution of the loop, then the loop will be classified as 
a while loop. The number of iterations that a while loop 
executes may be determined during the execution of the loop 
or on the fly. 

Counted loops are further distinguished between two 
kinds. The first kind of loop has a constant number of 
iterations known at compile time. For example, the header of 
a loop may be "for (i~0; i<200; i++)". The compiler will be 
able to determine that the number of loop iterations will be 
two hundred. The loop may be unrolled, but the loop body 
may not necessarily be copied two hundred times. Instead, 
the unrolling factor may be a smaller number that divides 
into two hundred. The loop body may be copied ten times 
and the new loop executed twenty times. In one 
embodiment, the loop unrolling factor is chosen such that 
the loop count is divided evenly. 

The second type of counted loop is variable. The point of 
a variable counted loop is that the compiler cannot deter- 
mine the number of loop iterations. The inability to deter- 
mine the number of iterations can be due to a variety of 
reasons. One example of the inability of a compiler to 
determine the number of iterations in compile time is the 
case where the value of loop iterations is read by the 
program from an input file. A function call is typically such 
a barrier to analysis. Compilers do perform inter function 
analysis. Conversely, a function call is not the only reason 
why a compiler is unable to figure out the number of loop 
iterations. 

In a variable counted loop, the number of loop iterations 
can be the result of a function call. The function call 
provides a number to be used as the loop upper bound. As 
a result, the compiler will not know at compile time how 
many times the loop will execute since the loop count may 
be different each time the loop is executed. So even though 
the loop is a counted loop, the trip count is unknown or not 
a compile time constant. 

In addition to counted loops and "while" loops, a third 
class of loops is introduced. This third class comprises 
speculatively counted loops. A speculatively counted loop 
satisfies all the requirements of a counted loop except for the 
characteristic that a speculatively counted loop has a loop 
upper bound that has not been proven to be loop invariant. 
Without the ability to classify loops as speculatively 
counted, these loops would have to be considered "while" 
loops. The compiler can transform a "while" loop into a 
speculatively counted loop by: (1) inserting an advanced 
load of the upper bound into a register; and (2) inserting an 
advanced load check before the loop termination test. Vari- 
ous optimizations such as software pipelining, whose effec- 
tiveness depends on classification of loops, may benefit from 
being able to transform "while" loops into speculatively 
counted loops, the following embodiments only demonstrate 
the way the loop unrolling optimization benefits from this 



capability. The description of speculatively counted loops 
and methods of constructing speculatively counted loops are 
presented within the context of loop unrolling. 

Knowledge that a loop is a counted loop allows the 

s compiler the opportunity to further is optimize the loop in 
ways that may not be available otherwise. One such opti- 
mization may be loop unrolling where a loop is unrolled 'n' 
times, such that 'n-1' additional copies of the loop body are 
made. When the unrolled loop is a counted loop, there is no 

10 need to test for the exit condition inside the unrolled body. 
But if the loop is a data dependent or while loop, then the 
exit condition needs be tested after each loop body. Because 
the exit condition is tested once during each iteration of the 
counted loop, 4 n-l' tests are saved per each iteration. 

15 In order to classify a loop as a counted loop, the compiler 
has to prove a number of conditions. These conditions may 
include identifying characteristics such as a linear induction 
variable and a loop invariant variable that serves as an upper 
bound for the loop in particular. If the compiler cannot prove 

20 that the upper bound is loop invariant, then the loop cannot 
be classified as a counted loop. One of the most common 
limitations to proving that a variable is loop invariant is 
showing that no memory location stores that executes inside 
the loop body can change the value of the loop upper bound, 

25 Unrolling a data dependent loop such as the following 
while loop may be generate less efficient code, while (a[i] 
!«0) do {. . . B(i) . . . } 

The compiler first copies the loop body a number of times. 
In this example, the compiler is designed to unroll all loops 
by a factor of four. Next, the compiler has to insert multiple 
exit tests to check for termination between every loop body. 
Similarly, if the loop had a variant upper bound, test state- 
ments would be needed to ensure that the upper bound had 
not changed. 
Loop {. . . 
B(i) 

if (a[i]=0) goto Exit 

B(i+1) 
if (a[i+l>0) goto Exit 
° B(i+2) 

if (a[i+2]=0) goto Exit 

B(i+3) 
if (a[i+3]-0) goto Exit 
45 else goto Loop 
Exit . . . } 

On the other hand, some counted loops may need the exit 
condition tested only once. In a counted loop, the original 
loop body is simply replaced with four copies of the loop 
50 body. 

But unrolling a loop having an indeterminate number of 
iterations is more complicated. The compiler may attempt to 
unroll the loop even though the value of the loop count V 
is unknown. If ( n J turns out to be two and the compiler had 

55 copied the loop four times, then the program code will be 
wrong since the loop will be executed four times before the 
exit condition is tested. Another issue in loop unrolling is 
that the trip count may not be evenly divisible. For instance, 
there may be no way to evenly divide a trip count of 

60 seventeen or nineteen. As a result, a clean up loop simply 
comprising the original loop with one loop body may be 
inserted after the unrolled loop. In the example loop having 
a trip count of seventeen, the processor may execute the 
unrolled loop with the four copies four times and the cleanup 

65 loop once for a total of seventeen iterations. 

Some important issues in loop unrolling are deciding 
which loops to unroll and by what to factor. The concerns 



30 



35 



01/29/2004, EAST Version: 1.4.1 



US 6,539,541 Bl 

11 12 

involved are architectural characteristics and the selection of speculatively counted is performed in a procedure that is 

particular loops in a particular program to unroll and the similar to the process used to classify loops as a counted 

unrolling factors to use for them. Architectural characteris- loop. 

tics include factors such as the number of registers available, Another problem encountered with loop unrolling is that 

the available overlap among floating-point operations and S the value-of the trip count V cannot change within the loop, 

memory references, and the size and organization of the Provin S ^ tri P is ™*f»nL ma ? be a dlfficu L lt task 

instruction cache. The impact of some architectural charac- for some compilers. For example, the program may have a 

teristics is often determined heuristically by experimenta- P ointer * at P oints t0 an inte S er value ' ^pending on the 

tion. As a result, unrolling decisions for individual loops can program language, a pointer may generally be assigned a 

benefit significantly based on feedback from profiled runs of 10 valu f anywhere within the program, including somewhere 

the program inside a loop. Furthermore, pointers may be dynamic and 

Ihc results of such experimentation may also depend on ^ len g th ? ™V chan S e - If the <* m P ile * « untbl f t0 P™ e 

the presence of the following loop characteristics: (1) the that none of the memory stores inside the loop change the 

presence of only a single basic block or straight-line code; value of V then the processor may not execute four 

(2) a balance of floating-point and memory operations or a is iterat ions of the loop body consecutively without testing for 

certain balance of integer memory operations; (3) small loo P termination between each body. For example, a loop 

number of intermediate-code instructions; and (4) loops mav 1°°^ 

having simple loop control. The first and second criteria n ~10 

restrict loop unrolling to loops that are most likely to benefit p=address n 

from instruction scheduling. The third characteristic 20 for (i«=0; i<n; i++) {. . . 

attempts to keep the unrolled blocks of code short so that x=y+z 

cache performance is not adversely impacted. The last *p«4 

criterion keeps the compiler from unrolling loops for which . . . } 

it is difficult to determine when to take the early exit to the The header of the above loop is "for (i=0; i<n; i++) " where 

unrolled copy for the final iterations, such as when travers- 25 the loop count 'n* may be a variable dynamically defined by 

ing a linked list. In one embodiment of the invention, the a function call or 'n* may be referenced by a pointer '*p' or 

unrolling factor may be anywhere from two on up, depend- modified within the loop body. The ambiguity introduced by 

ing on the specific contents of the loop body. Furthermore, pointer *p prevents loop optimization in conventional com- 

the unrolling factor of one embodiment will usually not be pilers. Since the upper bound 4 a' is not known at compile 

more than four and almost never more than eight. However, 30 time or may change within the loop, the compiler will 

further development of VLIW or EPIC machines may pro- consider the loop as having an unknown upper bound. Hence 

vide good use for larger unrolling factors. the loop would be treated like a while loop. The present 

In one embodiment, the number of copies made of the invention may allow for the transformation of loops that 

loop body is determined heuristically. In another look like counted loops, but have loop upper bounds that 

embodiment, the compiler may provide the programmer 35 cannot be proven as loop invariant, 

with a compiler option to specify which loops to unroll and A statement in a computer program is said to define a 

what factors to unroll them by. Aperformance tradeoff exists variable if it assigns, or may assign, a value to that variable, 

depending on how many times the loop body is copied. One For example, the statement "x=y+z" is said to define V. A 

factor involved is the size of the instruction cache. Code statement that defines a variable contains a definition of that 

performance may be impacted if a loop body is copied too 40 variable. In this context there are two types of variable 

many times since the block of new code may not fit into the definitions: unambiguous definitions and ambiguous defini- 

instruction cache. A programmer may want to grow the tions. Ambiguous definitions may also be called complex 

number of instructions in a loop body so that the computer definitions. When a definition always defines the same 

has a larger contiguous block of code to execute. However, variable, the definition is said to be an unambiguous defi- 

the body of instructions should fit into the instruction cache 45 nition of that variable. For example, the statement, "x=y" 

or else a performance hit may occur. Hence, the programmer always assigns the value of 'y' to 'x\ Such a statement 

may start initially with a loop that originally fits in the always defines the variable V with the value of t y\ Thus, 

instruction cache, but end up with a large block of instruc- the statement "x-y" is an unambiguous definition of 'x*. If 

tions that no longer fits into the cache. all definitions of a variable in a particular segment of code 

In the present invention, a new classification of loops 50 are unambiguous definitions, then the variable is known as 

called speculatively counted loops is created. Speculatively an unambiguous variable. 

counted loops have generally been classified as data depen- Some definitions do not always define the same variable 

dent exit loops and hence, not optimized as a counted loop. and may possibly define different variable at different times 

Speculatively counted loops have a construct similar to that in a computer program. Thus they are called ambiguous 

of counted loops, but some speculatively counted loops may 55 definitions. There are many types of ambiguous definitions, 

have stores to memory that cannot be disambiguated from One type of ambiguous definition occurs where a pointer 

the loop upper bound. Hence, the reason the compiler did not refers to a variable. For example, the statement "*p=y" may 

classify the loop as a counted loop was because the loop be a definition of V since it is possible that the pointer *p' 

upper bound could not be disambiguated. One example points to *x\ Hence, the above ambiguous definition may 

where the compiler cannot prove that the upper bound is 60 ambiguously define any variable 'x' if it is possible that *p' 

loop invariant is in a loop involving pointers to arrays. A points to 'x'. In other words, **p* may define one of several 

speculatively counted loop would have been classified as a variables depending oa the addressed value of 'p*. Another 

counted loop if the loop upper bound had been disambigu- type of ambiguous definition is a call of a procedure with a 

ated from all memory stores in the loop. Hence, the reason variable passed by reference. When a variable is passed by 

the compiler did not classify the loop as a counted loop was 65 reference, the address of the variable is passed to the 

because the loop upper bound could not be disambiguated. procedure. Passing a variable by reference to a procedure 

In one embodiment, the process of classifying a loop as allows the procedure to modify the variable. Alternatively, 



01/29/2004, EAST Version: 1.4.1 



US 6,539,541 Bl 

13 14 

variables may be passed by value. Only the value of the then the memory load would be fetching the wrong data 

variable is passed to the procedure when a variable is passed since the correct data has not yet been stored at the memory 

by value. Passing a variable by value does not allow the location. If the variable that stores the loop upper bound 

procedure to modify the variable. Another type of ambigu- cannot be disambiguated from all the memory stores in the 

ous definition is a procedure that may access a variable S loop, then that value has to be read/reloaded from memory 

because that variable is within the scope of the procedure. prior to each comparison, ad the loop cannot be treated as a 

Yet another type of ambiguous definition occurs when a counted loop. 

variable is not within the scope of a procedure but the The compiler may use the advance load construct in 

variable has been identified with another variable that is situations where the compiler cannot verify that the memory 

passed as a parameter or is within the scope of the procedure. 10 locations are different. The present invention may be used 

When the compiler unrolls a loop having a data dependent with counted loops that have upper bounds that cannot be 

exit, the compiler makes copies of the loop body T and the disambiguated from memory stores within the loop body, 

exit test. The exit test allows the processor to take side exits One example may be a loop that contains a pointer into a 

out of the loop during program execution. Data dependent large array. The compiler may not be able to verify that the 

exit loops are generally tested for loop termination between 15 pointer does change loop upper bound. In one embodiment, 

each copy of the body. If the loop has to terminate, then the the advance load (Id. a) and advance load check (chk.a) 

processor has to go to a loop exit. If the exit condition tests instructions interact with a hardware structure called the 

true, then the loop has to terminate and the program goes to advanced load address table (ALAT). The advanced load 

a loop exit. If the condition is false, then the next loop body instruction causes the processor to perform a load from a 

'i+1* is executed and the test for loop termination performed 20 memory location and write the memory address into an 

again. ALAT. The ALAT acts as a cache of the physical memory 

One advantage of the present invention may be the address and the physical register address accessed by the 
omission of the exit condition test between copies of the most recently executed advanced loads. The size and con- 
loop body. However, the compiler needs to determine figuration of the ALAT is implementation dependent. A 
whether the speculatively counted loop may be correctly 25 straightforward implementation of one embodiment may 
treated as a counted loop. If the compiler cannot make such have entries containing a physical memory address field, an 
a determination, then the tests for loop termination and side access size field, a register address field, and a register type 
exits are kept in the loop. One criteria in determining if a field (general or floating-point). Using the target register 
speculatively counted loop may be treated like a counted address and type as an index, advanced loads allocate a new 
loop is whether the loop upper bound is loop invariant. In 30 entry in the ALAT containing the physical address and size 
order to prove that the upper bound is truly loop invariant, of the region of memory being read, 
the compiler needs to analyze the stores that occur inside the During each memory store, the processor scans the ALAT 
loop. The process of proving that two memory operands are for any entries having the same memory address. Store 
different is called memory disambiguation. instructions would cause the processor to search all entries 

Data speculation occurs when a later load is scheduled 35 in the ALAT using the physical address and size of the 

above an earlier store and the compiler cannot verify that the region of memory being written. All entries corresponding 

load and store will never access overlapping areas of to overlapping regions of memory are invalidated. Advanced 

memory. The process of determining whether loads and load checks access the ALAT using the target register 

stores access overlapping areas of memory is termed "dis- address and type as an index. If the corresponding ALAT 

ambiguation." A load-store pair for which the compiler 40 entry is not valid, then either a store subsequent to the 

cannot guarantee that the load and store will never access advanced load accessed an overlapping area of memory or 

overlapping areas of memory are termed "un- the advanced load's entry has been replaced. The advanced 

disambiguated." In the following text, the phrase "un- load check then performs the normal load operation for 

disambiguated store" will be used to refer to the store in an memory access corresponding to the invalid ALAT entry, 

un-disambiguated load-store pair. A store cannot be 45 But if the ALAT entry accessed by the advanced load check 

un-disambiguated by itself, but only in the context of a is valid, then the advanced load had received correct data 

particular load. and the advanced load check performs no action. 

Compilers often perform memory disambiguation to One embodiment uses "advanced loads" or "data specu- 
prove that a loop upper bound is loop invariant. Sometimes, lative loads" to handle un-disambiguated memory load-store 
two memory operands may appear to be different, but the 50 pairs. Support for data speculation may take the form of the 
compiler is unable to verify that the two operands are indeed advance load (Id.a) and advance load check (chk.a) instruc- 
different. Memory disambiguation attempts to verify that tions. A memory load that is statically scheduled above an 
two variables are not the same and are not affected by earlier store when the pair are un-disambiguated is con- 
changes to the other. In one embodiment, the processor may verted into an advanced load. However, if the load -store pair 
include a special construct to assist compilers in the task of 55 can be disambiguated then the load does not need to be 
memory disambiguation. One special construct for memory converted into an advanced load. When the compiler con- 
disambiguation is the advance load or data speculative load. verts a particular load into an advanced load, a correspond - 
For example, a program has stored apiece of data at memory ing advanced load check is scheduled at a point below the 
location X. At some later point in the program, a piece of lowest un-disambiguated store in the originating basic block 
data is loaded from memory location Y If the compiler tries 60 of the advanced load. Thus the advanced load and advanced 
to schedule the memory load before the memory store, the load check instructions bracket one or more 
resulting program is legal only if locations X and Y are un-disambiguated stores. The advanced load check should 
different memory locations. If the compiler can prove that be configured to perform the same memory access in both 
memory locations X and Y are indeed different, then the address and size, and write the same destination register as 
compiler can switch the order of the store and load instruc- 65 the advanced load. 

tions. But if locations X and Y are the same memory location The advance load check constructs is related to the 

and the order of the store and load instructions are switched, advance load. In one embodiment, the compiler will insert 



01/29/2004, EAST Version: 1.4.1 



US 6,539,541 Bl 

15 16 

an advanced load check instruction between copies of the specifies a target register that needs to have the same address 

loop body in the unrolled loop. The advanced load check and type as the corresponding advanced load. In the event of 

statement may be inserted just prior to the statement that an invalid entry in the ALAT, program control is transferred 

uses the advance loaded data. The advanced load check to a recovery block. The recover block contains code that 

instruction directs the processor to check the ALAT for a s comprises a copy of the advanced load in non-speculative 

specific memory address. The advanced load check instruc- form and all of the dependent instructions prior to the chk.a. 

tion checks to see if the advance loaded data has been After completion of the recovery code, the program resumes 

modified by a memory store. If the data has been changed, normal execution. However, the point at which normal 

then the data has to be reloaded, In one embodiment, a failed execution is not predefined. The recovery block has to end 

check indicates that the data advance loaded from the given 10 with a branch instruction to redirect execution at a continu- 

memory location has been superseded with more recently ation point in the main thread of execution. One goal of the 

stored data. If the desired memory address is missing from recovery block is to maintain program correctness. If a 

the ALAT or if the entry has been invalidated, then the check memory store in the loop body is changing the value of the 

has failed and a memory load needs to be execute again for upper bound, the recovery block or cleanup loop may also 

the specified memory location. In either situations of the 15 revert the loop back to its original form and simply iterates 

missing ALAT entry or invalidated ALAT entry, a new one loop body at a time. 

memory load is performed so that the instruction requesting The present invention discloses a method to optimize a 
the desired data will be using the correct result. Hence by speculatively counted loop. Unrolling speculatively counted 
using the advanced load and advanced load check constructs loops is similar to unrolling counted and while loops. When 
in a program, the compiler can change the order of the loads 20 a speculatively counted loop is unrolled, the loop body is 
and stores without causing the program to function incor- copied 'n-1' times. The compiler also adds a statement into 
recti y. the preheader of the loop to perform a data speculative load 
Referring to FIGS. 3A and 3B, use of the advance load of the loop upper bound from memory into a register. In 
and advanced load check are illustrated. FIG. 3 A illustrates another embodiment, the statement may be inserted at a 
a load-store pair in a code stream. The "store x=R2" instruc- 25 point that is outside of the loop. The data speculative load is 
tion represents a memory store of the contents of register also referred to as an advanced load. Then between every 
'R2' to memory operand 'x\ The "R3 load y" instruction two loop bodies in the unrolled loop, the compiler inserts a 
represents a memory load of memory operand *y' to register speculation check instruction. The check instruction of one 
*R3\ FIG. 3B illustrates the code after an advance load. In embodiment is an advance load check. The speculation 
FIG. 3 A, the "R3=load y" instruction may be moved above 30 check is related to the advanced load that was added to the 
the "store x=R2" only if memory operands 'y' and *x' are preheader. A speculation check determines whether the 
different. Otherwise, the move would be illegal. The memory location that was speculatively loaded has been 
advanced load check (chk.a) is used when a memory load is changed by a subsequent store to memory. If the specula- 
move earlier in the instruction stream for advance loading. tively loaded memory located has been changed, then con- 
The memory load of FIG. 3 A has been moved earlier in the 35 trol is transferred to the recovery block, 
instruction stream and modified to become an advance load FIGS. 4A, 4B, and 4C illustrate three different versions of 
as illustrated by "R3=Id.a y" in FIG. 3B. Correspondingly, a loop. FIG. 4A illustrates the loop before the loop unrolling 
a advanced load check "chk.a R3" has been inserted at the transformation. FIG. 4B illustrates the loop of FIG. 4A if 
original location of the memory load and just prior the use unrolled by a factor of three as a 'while* loop. FIG. 4C 
of register *R3\If the advance loaded value of register *R3' 40 illustrates the loop of FIG. 4A unrolled three times as a 
from memory operand 'y' has been modified before the speculatively counted loop. The loop counter or control of 
advanced load check, then memory load needs to occur all three versions is represented by 'i' and the termination 
again in order to correct the changes. If instructions that are count is represented by V. 

data dependent upon the advanced load are not scheduled In one embodiment, the compiler may use the advance 

above an un-disambiguated store, then only the memory 45 load and advanced load check constructs in a program loop 

load instruction needs to be re-executed in the event of an if the only instruction relevant to the contents of a memory 

overlap between the advanced load and a memory store. address moved before the memory store is the memory load. 

This operation is the function of the advanced load check. The compiler starts by generating an advance load of the 

However, if one or more instructions dependent upon the upper bound. The loop body may then be copied and the 

advance load are scheduled above an un-disambiguated 50 count incremented accordingly. But instead of testing 

store, then in the event of an overlap all of these rescheduled between each loop body for loop termination as in a while 

instructions need to be re-executed in addition to the loop, the compiler generate an advanced load check that 

memory load. corresponds to the target of the advance load. The compiler 

The chk.a instruction is used to determine whether certain also appends a cleanup loop having a single loop body to the 

instructions needed to be re -executed. The compiler can use 55 unrolled loop. A failed check would cause a recovery and 

the advance load check (chk.a) if other instructions are also memory load to be performed so that the program execution 

moved before the memory store. The advance load check could continue correctly. Furthermore, the compiler may 

branches the execution to another address for recovery if the also take certain instructions that use the value that was 

check fails. The advance load check (chk.a) instruction of advance loaded, such as an add instruction, and move those 

one embodiment has two operands. One operand is the 60 instructions before the memory store in the code. The 

register containing the data loaded by advance load. The method of constructing and unrolling speculatively counted 

second operand is the address of the recovery block. The loops does not have to keep track of any specific store that 

recovery block can be simple and just branch to the cleanup cannot be disambiguated from the load of an upper bound, 

loop in one embodiment. If the chk.a cannot find a valid Once the load of the upper bound is not proven to to be loop 

ALAT entry for the advance load, then the program branches 65 invariant, there is no longer a need to keep track of a specific 

to a recovery routine in an attempt to fix any mistakes made store. There may be any number of such stores. However, if 

by using the wrongly loaded data. The chk.a instruction the advanced load check fails, then the moved instructions 



01/29/2004, EAST Version: 1.4.1 



US 6,539,: 

17 

may have to be re-executed again after the correct data is 
loaded in order to maintain program correctness. 

The function of the advanced load check in one embodi- 
ment includes branching to a recovery block if the check 
fails. During the loop unrolling transformation of one 5 
embodiment, the compiler can generate code for the recov- 
ery block that will re-execute all the instructions that were 
moved in front of the memory store. The recovery block of 
one embodiment may also branch to a cleanup loop. Once 1Q 
the processor completes the recovery, the program may 
direct the processor to branch from the end of the recovery 
block to back a point in the program after the originating 
advanced load check. The processor can then continue 
program execution as before the check failed. Hence if the 35 
advanced load check does not fail, the overhead is negligible 
and the program may execute quickly. The compiler can 
generate a recovery block by saving a copy of the original 
loop. The recovery block contains code to perform a new 
load of the memory location into a register and to transfer 20 
loop control to a version of the loop that is identical to the 
original version of the loop. The speculation check instruc- 
tion and the recovery block are measures to ensure correct 
loop execution. In one embodiment, the recovery block is 
not part of the actual loop and the check instruction is 25 
comprised of one instruction. Hence, the performance of an 
unrolled speculatively counted loop may approach that of 
code generated for an unrolled counted loop. 

FIG. 5 is a flow diagram that illustrates steps for con- 30 
structing and unrolling speculatively counted loops in one 
embodiment of the present invention. Software developers 
may often decide to optimize computer programs in attempt 
to improve performance. One such code optimization 
method may entail the steps as shown in FIG. 5. The 35 
compiler parses the program code for loops at step 510. 
When a loop is encountered at step 515, the compiler 
determines whether the loop is a counted loop. If the loop is 
a counted loop, then the compiler attempts to optimize the 
loop as a counted loop at step 520, If the loop is found not 40 
to be a counted loop, the compiler goes on to step 525 to 
determine whether the loop is a speculatively counted loop. 
If the loop is found not to be a speculatively counted loop, 
the compiler attempts to optimize the loop as a non- 
speculatively counted or "while" loop at step 530. When the 45 
compiler has determined that a speculatively counted loop is 
present, load instructions of upper bounds are located within 
the loop at step 535. Advance loads are inserted at the loop 
preheader at step 540. At step 545, the compiler generates 
and adds a cleanup loop. The cleanup block and recovery 50 
block in one embodiment may be identical or simply point 
to the other block of code. Memory load instructions are 
changed to advanced load check instructions at step 550. 
The original loop body is unrolled at step 555. The unrolling 
factor of one embodiment is determined heuristically. In 55 
another embodiment, the unrolling factor may be user speci- 
fied or predetermined. 

In the foregoing specification, the invention has been 
described with reference to specific exemplary embodiments 
thereof. For purposes of explanation, specific numbers, 60 
systems and configurations were set forth in order to provide 
a thorough understanding of the present invention. It will, 
however, be evident that various modifications and changes 
may be made thereof without departing from the broader 
spirit and scope of the invention as set forth in the appended 65 
claims. The specification and drawings are, accordingly, to 
be regarded in an illustrative rather than a restrictive sense. 



541 Bl 

18 

What is claimed is: 

1. A method of constructing an unrolled loop comprising: 
identifying a speculatively counted loop, wherein said 

speculatively counted loop includes a loop upper bound 
that has not been proven to be loop invariant; 

locating a memory load instruction within loop body of 
said speculatively counted loop; 

inserting an advance load instruction into a preheader of 
said speculatively counted loop; 

replacing said memory load with an advanced load check 
instruction; unrolling said loop body of said specula- 
tively counted loop; and 

generating a cleanup block for said speculatively counted 
loop. 

2. The method of claim 1 further comprising converting a 
while loop into a speculatively counted loop. 

3. The method of claim 1 wherein said loop body is 
unrolled by a predetermined unrolling factor. 

4. The method of claim 1 further comprising moving 
instructions located within said loop from a first location 
after a memory store instruction to a second location before 
said memory store instruction in said loop. 

5. The method of claim 1 wherein said cleanup block 
comprises a rolled copy of original loop body. 

6. The method of claim 1 further comprising removing 
termination tests from between unrolled loop bodies. 

7. The method of claim 1 further comprising generating a 
recovery block for said loop. 

8. The method of claim 1 wherein said cleanup block is a 
recovery block. 

9. A method of optimizing program performance com- 
prising: 

identifying a loop, said loop having a memory, load that 
cannot be disambiguated from a loop upper bound 
wherein said loop upper bound has not been proven to 
be loop invariant; 

locating a memory load instruction for said memory load 
within loop body of said loop; 

inserting an advance load instruction in preheader of said 
loop; 

replacing said memory load instruction with an advanced 

load check instruction; 
unrolling said loop body; and 
generating a cleanup block. 

10. The method of claim 9 wherein said loop is a 
speculatively counted loop. 

11. The method of claim 9 wherein said loop is a data 
dependent while loop, 

12. The method of claim 9 wherein said loop body is 
unrolled by a predetermined unrolling factor. 

13. The method of claim 9 further comprising converting 
a while loop into a speculatively counted loop. 

14. The method of claim 9 further comprising moving 
instructions located within said loop from a first location 
after a memory store instruction to a second location before 
said memory store instruction in said loop. 

15. The method of claim 9 wherein said cleanup block 
comprises a rolled copy of original loop body. 

16. The method of claim 9 further comprising removing 
termination tests from between unrolled loop bodies. 

17. The method of claim 9 further comprising generating 
a recovery block for said loop. 

18. The method of claim 9 wherein said cleanup block is 
a recovery block. 

19. A computer readable medium having embodied 
thereon a computer program, the computer program being 
executable by a machine to perform: 



01/29/2004, EAST Version: 1.4.1 



us 6,5: 

19 

identifying a Loop, wherein said loop includes a memory 
load that cannot be disambiguated from a loop upper 
bound; 

locating a memory load instruction within loop body of 
said loop; 

inserting an advance load instruction in preheader of said 
loop; 

replacing said memory load instruction with an advanced 

load check instruction; 
unrolling said loop body; and 
generating a cleanup block. 

20. The computer readable medium having embodied 
thereon a computer program in claim 19 wherein said loop 
is a speculatively counted loop. 

21. The computer program being executable by a machine 
in claim 19 to further perform moving instructions located 
within a loop from a first location after a memory store 
instruction to a second location before said memory store 
instruction in said loop. 

22. The computer readable medium having embodied 
thereon a computer program in claim 19 wherein said 
cleanup block comprises a rolled copy of original loop body. 

23. The computer program being executable by a machine 
in claim 19 to further perform removing termination tests 
from between unrolled loop bodies. 

24. The computer program being executable by a machine 
in claim 19 to further perform generating a recovery block 
for said loop. 

25. The computer readable medium having embodied 
thereon a computer program in claim 19 wherein said 
cleanup block is a recovery block. 



(9,541 Bl 

20 

26. A digital processing system having a processor oper- 
able to perform: 
identifying a loop, said loop having a loop upper bound 

not proven to be loop invariant; 
locating a memory load instruction within loop body of 

said loop; 

inserting an advance load instruction in preheader of said 
loop; 

1° replacing said memory load instruction with an advanced 
load check instruction; 
unrolling said loop body; and 
generating a cleanup block. 
15 27. The digital processing system of claim 26 wherein 
said loop is a speculatively counted loop. 

28. The digital processing system of claim 26 to further 
perform moving instructions located within said loop from 
a first location after a memory store instruction to a second 

20 location before said memory store instruction in said loop. 

29. The digital processing system of claim 26 wherein 
said cleanup block comprises a rolled copy of original loop 
body. 

30. The digital processing system of claim 26 to further 
25 perform removing termination tests from between unrolled 

loop bodies. 

31. The digital processing system of claim 26 to further 
perform generating a recovery block for said loop. 

32. The digital processing system of claim 26 wherein 
30 said cleanup block is a recovery block. 

***** 



01/29/2004, EAST Version: 1.4.1 



UNITED STATES PATENT AND TRADEMARK OFFICE 

CERTIFICATE OF CORRECTION 



PATENT NO. : 6,539,541 Bl Page 1 of 1 

DATED : March 25,2003 

INVENTOR(S) :Geva 



It is certified that error appears in the above-identified patent and that said Letters Patent is 
hereby corrected as shown below: 



Column 2, 

Between lines 65 and 66, insert --/ = /+ 3 --. 
Column 10, 

Line 5, before "optimize", delete "is". 
Line 67, before "factor", delete "to". 

Column 15, 

Line 27, delete "R3 load y", insert -- R3 = loady-. 



Signed and Sealed this 
Tenth Day of June, 2003 




JAMES E. ROGAN 
Director of the United States Patent and Trademark Office 



01/29/2004, EAST Version: 1.4.1 



United States Patent m 

Santhanam 



■■IIIIIIIMIIII 

US005704053A 
[ii] Patent Number: 
[45] Date of Patent: 



5,704,053 
Dec 30, 1997 



[54] EFFICIENT EXPLICIT DATA PREFETCHING 
ANALYSIS AND CODE GENERATION IN A 
LOW-LEVEL OPTIMIZER FOR INSERTING 
PREFETCH INSTRUCTIONS INTO LOOPS 
OF APPLICATIONS 

[75] Inventor: Vatsa Santhanam, Campbell, Calif. 

[73] Assignee: Hewlett-Packard Company, Palo Alto, 
Calif. 

[21] AppL No.: 443,653 
[22] Filed: May 18, 1995 

[51] Int CL 6 « « G06F 9/45 

[52J U.S.C1 ™ 284/383;395/705 

[58] Field of Search „ 395/425, 375, 

395/700. 383, 705 

[56] References Cited 

U.S. PATENT DOCUMENTS 

5,193,167 3/1993 Sites ct al. 35)5/425 

5,214,766 5/1993 Uu ...» 3^425 

5377336 12/1994 Etckemeyer et al 395/375 

5396,604 3/1995 DeLano et al. 395/375 

5337,620 7/1996 Bretenutz, Jt 395/700 

OTHER PUBLICATIONS 

Chen, T-F, et al., A Performance Study Of Software & 
Hardware Data Prefetching Schemes, Apr. 1, 1994, Com- 
puter Architecture News, vol. 22, No. 2, pp. 223-232. 
Abraham, S G, et aL, Predictability Of Load/Stare Instruc- 
tion Latencies, Dec. 1-3, 1993, Proceedings Of The Annual 
International Symposium On MicroArchitect. Austin, pp. 
139-152. 

Chi, C-H, et al., Compiler Driven Data Cache Prefetching 
for High Performance Computers, Proceedings of the 
regional 10 Annual International Conference, Tenco, Sin- 
gapore, Aug. 22-26, 1994, vol. 2, No. Conf . 9, pp. 274-278. 
Callahan, et al, Software Prefetching, ACM Sigplan 
Notices, vol. 26, No. 4, Apr. 8, 1991, pp. 40-52. 
Mowry, T C, et al, Design and Evaluation of a Compiler 
Algorithm for Prefetching, ACM Sigplan Notices, vol, 27, 
No. 9, Sep. 1, 1992, pp. 62-73. 



Callahan, David, et al., "Software ^fetching", 1991, Asso- 
ciation for Computing Machinery. 
Klaiber, Alexander C, et al., "An Architecture for Software- 
Controlled Data Prefetching", May 1991. Infl Symposium 
on Computer Architecture. " 

Chen, William Y, et al., "Data Access Miaoarchitectures for 
Superscalar Processors with Compiler-Assisted Data 
Prefetching", Proceedings of the 24th Int'l Symposium on 
Microarchitecture. 

Mowry, Todd C, et al., "Design and Evaluation of a Com- 
piler Algorithm for Prefetching", 1992, Association for 
Computing Machinery. 

Johnson, Eric E., "Working Set Prefetching for Cache 
Memories**, 

(List continued on next page.) 

Primary Examiner—Thomas C. Lee 
Assistant Examiner— David Ton 

[57] ABSTRACT 

A compiler mat facilitates efficient insertion of explicit data 
prefetch instructions into loop structures within applications 
uses simple address expression analysis to determine data 
prefetching requirements. Analysis and explicit data cache 
prefetch instruction insertion are performed by the compiler 
in a machine-instruction level optimizer to provide access to 
more accurate expected loop iteration latency infonnation. 
Such prefetch instruction insertion strategy tolerates worst- 
case alignment of user data structures relative to data cache 
lines. Execution profiles tram previous runs of an applica- 
tion are exploited in the insertion of prefetch instructions 
into loops with internal control flow. Cache line reuse 
patterns across loop iterations are recognized to eliminate 
unnecessary prefetch instructions. The prefetch insertion 
algorithm is integrated with other low-level optimization 
phases, such as loop unrolling, register reassociation, and 
instruction scheduling. An alternative embodiment of the 
compiler limits the insertion of explicit prefetch instructions 
to those situations where the lower bound on the achievable 
loop iteration latency is unlikely to be increased as a result 
of the insertion. 

17 Claims, 9 Drawing Sheets 




01/29/2004, EAST Version: 1.4.1 



5,704,053 

Page 2 



OTHER PUBLICATIONS 

Garnish, Edward H M et aL, "Compiler-Directed Data 
Prefetching in Multiprocessors with Memory Hierarchies", 
1990, Association far Computing Machinery. 
Gupta, Anoop, et al., "Comparative Evaluation of Latency 
Reducing and Tolerating Techniques", 1991, Association for 
Computing Machinery. 



Fu, John W. C, et al., "Data Prefetching in Multiprocessor 
Vector Cache Memories", 1991, Association far Computing 
Machinery. 

Chen, Tien-Fu, et al., "Reducing Memory Latency via 
Non-blocking and Prefetching Cahces", 1992, Association 
for Computing Machinery. 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Dec 3d, 1997 sheet 1 of 9 5,704,053 




CACHE 



15 



12 



J 



10 



SYSTEM BUS 



13 



MEMORY 



14 




Fig. 1 (PRIOR ART) 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Dec. 30, 1997 Sheet 2 of 9 5,704,053 




01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Dec 30, 1997 Sheet 3 of 9 
i=0 



5,704,053 



10 cycles 



top: I j 




...A[i]... 




array element size 






= 8 bytes 


i = i+l 




xlOO 


if (i<100) 






go to top 













Fig. 3 

(PRIOR ART) 



direct mapped data cache 



0 
1 

2 
3 

o 

0 

o 

0 
0 

a 

N 











A[0] 


A[l] 


A[2] 


A[3] 


A[4] 


A[5] 


A[6] 


A[7] 


o e o 




































^ 


32 bytes ► 



cache line size = 32 bytes 



Fig. 4 

(PRIOR ART) 



8 bytes 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Dec. 30, 1997 Sheet 4 of 9 
i = 0 



5,704,053 



1 1 cycles 



1 i 


...A[i]... 






. . .prefetch A[i+4] 




i = i+5 






if(i<100) 






go to top 






1 





xlOO 



Fig. 5 (PRIOR ART) 



i = 0 



37 cycles 



t i — 


A[i] 




A[i+1] 




A[i+2] 




A[i+3] 




i = i+4 




if (i<100) 




go to top 







25x 



Fig. 6 (PRIOR ART) 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Dec. 30, 1997 Sheet 5 of 9 5,704,053 



i = 0 



38 cycles 



A[i] 
A[i+1] 
A[i+2] 
A[i+3] 

prefetch A[i+8] 

i = i+4 

if (i<100) 
go to top 



Fig. 7 

(PRIOR ART) 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Dec 30, 1997 Sheet 6 of 9 



5,704,053 



24 



LLO 




instruction scheduling 
+ register allocation 




120a^ 



LLIR 



Fig. 8 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Dec 30, 1997 



Sheet 7 of 9 



5,704,053 




C— T _ 96 




01/29/2004, EAST Version: 1.4.1 



U.S. Patent Dec 30, 1997 Sheet 8 of 9 5,704,053 




y^identify 
( region 
X^c onstant s 

identify simple 
task induction variables 



I 



compute and linearize 
address expressions 
for affine memory references 




H 



Fig. 10 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Dec. 30, 1997 Sheet 9 of 9 5,704,053 




small stride 
equivalence 
class 

^pply small stride clusters, 
identified above ) 
40 equivalence class^/ 



large stride 
g equivalence 



class 



/apply large stride cluster 
( identified above 
\to equivalence class. 



Fig. 1 1 




205 




group address expressions 
with group spatial locality 
into clusters 



I _206 

202 merge clusters""""^ 

that share 
group temporal locality 





Fig. 12 



01/29/2004, EAST Version: 1.4.1 



5,7< 

1 

EFFICIENT EXPLICIT DATA PREFETCHING 
ANALYSIS AND CODE GENERATION IN A 
LOW-LEVEL OPTIMIZER FOR INSERTING 
PREFETCH INSTRUCTIONS INTO LOOPS 
OF APPLICATIONS 

BACKGROUND OF THE INVENTION 

1. Technical Held 

The invention relates to techniques for reducing data 
cache overhead in a computer system. More particularly, the 
invention relates to compiler-related techniques that are 
useful for reducing data cache overhead. 

2. Description of the Prior Art 

Data cache misses (described in greater detail below) can 
account for a significant portion of an application program's 
execution time on modern processors. This is particularly 
true in the case of scientific applications that manipulate 
large data structures which run on high frequency processors 
having long memory latencies. With increasing mismatch 
between processor and memory, the high penalty of cache 
misses has become and continues to be a dominant perfor- 
mance lirniter of microprocessors. Increasing the cache size 
is one way to reduce cache misses. However, because the 
size of many numerical applications is also growing rapidly 
from generation to generation, the first level cache may not 
always be large enough to capture critical working sets. 

Most modem computer systems employ such caches to 
bridge the gap between memory and processor speeds. 
However, despite high cache hit ratios, the cost of cache 
misses in high frequency processors can significantly 
degrade run-time performance. To illustrate this point, a 
plausible scenario has been suggested where a cache miss 
penalty is 100 processor cycles and a data reference occurs 
every four cycles. See, for example Alexander C. Klaiber, 
Henry M. Levy, An Architecture for Software-Controlled 
Data frefetching, ftoceedings of the 18th Annual Interna- 
tional Symposium on Computer Architecture, May 1991. 
Even assuming a cache hit ratio of 99%, the processor is 
stalled for memory 20% of the time. 

One way to ameliorate the high overhead of data cache 
misses is to overlap the fetching of data from memory to the 
data cache with other useful computations. Certain high- 
performance superscalar microprocessors are able to 
achieve some degree of overlap between data cache miss 
handling and processor computation automatically through 
out-of-order instruction execution, facilitated by instruction 
queues capable of holding renamed register results, in con- 
junction with a split-transaction memory bus (e.g. such 
microprocessors as the Silicon Graphics T5, Hewlett- 
Packard PA8000, and Sun Ultrasparc). However, the degree 
cf overlap typically achieved is insufficient to fully cover an 
external data cache miss latency. 

Some of these microprocessors support explicit data 
prefetch instructions that may be used to reduce the high 
overhead of data cache misses more effectively. Such 
instructions are typically defined to initiate data cache miss 
handling without holding up instruction execution until the 
referenced data is retrieved from memory. 

By inserting explicit data cache prefetch Instructions into 
the code stream, a compiler can help ameliorate the high cost 
of data cache misses. However, this approach must be 
implemented judiciously because explicit cache prefetch 
instructions, in general, increase the dynamic path length of 
an application, and the added overhead may not be offset by 
a corresponding decrease in data cache miss overhead. 



W,053 

2 

There is much published literature on cache design trade- 
offs, and hardware approaches to improving cache perfor- 
mance. Comparatively, however, there is much less litera- 
ture on improving cache performance through software- 

5 controlled active cache management The few papers that 
discuss software-controlled data prefetching to improve 
cache performance include Todd C. Mowry, Monica S. Lam, 
Anoop Gupta, Design and Evaluation of a Compiler Algo- 
rithm for Prefetching, Proceedings of the 5th International 

to Conference on Architectural Support for Programming Lan- 
guages and Operating Systems, October 1992; Alexander C. 
Klaiber, Henry M. Levy, An Architecture for Software- 
Controlled Data Prefetching, Proceedings of the 18th 
Annual International Symposium on Computer 

15 Architecture, May 1991 (an approach based on hand 
analysis); and Software Prefetching, David Callahan, Ken 
Kennedy, Allan Porterfield, Proceedings of the 4th Interna- 
tional Conference on Architectural Support for Program- 
ming Languages and Operating Systems, April 1991 (where 

20 a prefetch instruction is added for each loop body memory 
reference without considering or exploiting cache line 
re-use. such that there is no selectivity; and where the 
prefetch insertion is performed at the source-code level, such 
that there is little integration with other compiler optimiza- 

25 tion phases; additionally, because the analysis is done at the 
source code level, it is difficult to estimate the prefetch 
iteration distance (PFID), i.e. the PFTD used is always one 
loop iteration, which may be insufficient to hide the full 
cache miss latency). 

30 These papers concentrate on explicit prefetches for sub- 
scripted variables that are referenced in loops. They do not 
discuss insertion of explicit prefetch instructions into 
straight-line code for scalar or indirect memory references. 
Furthermore, it is generally assumed that the arrays of 

35 interest are all aligned on cache line boundaries. 

There are some general observations that are more or less 
common to the different studies of software-controlled data 
prefetching. One such observation is that data prefetching 
docs not come for free. Specifically, explicit prefetches use 

40 up instruction issue bandwidth. In addition to the prefetch 
instruction itself, typically one or more instructions are 
needed to compute the address of the memory location to be 
prefetched. Recycling the computed prefetch address for the 
actual reference can involve tying up registers for extended 

45 lifetimes. The increased register pressure can result in the 
introduction of spill code in expensive loops. This can offset 
the expected performance gains due to prefetching. 
A simple prefetch strategy, such as the one proposed by 

^ David Callahan, Ken Kennedy, Allan Porterfield, Software 
Prefetching, Proceedings of the 4th International Conference 
on Architectural Support for Programming Languages and 
Operating Systems, April 1991, can wastefully increase the 
number of executed instructions through multiple prefetch 

53 requests for lines already in the data cache. 

Another important consideration cited by the different 
papers on software-controlled data prefetching is the actual 
placement of the prefetch instructions. If a prefetch is issued 
too dose time-wise to the memory reference that needs to 

50 access the prefetched data, the prefetched data may not be 
available in time to avoid a CPU stall. On the other hand, if 
the prefetch is issued too early, there is a possibility of the 
prefetched line being displaced from the cache prematurely. 
Todd C Mowry, Monica S. Lam, Anoop Gupta, Design 

65 and Evaluation of a Compiler Algorithm for Prefetching, 
Proceedings of the 5th International Conference on Archi- 
tectural Support for Programming Languages and Operating 



01/29/2004, EAST Version: 1.4.1 



5,704,053 



Systems, October 1992, discuss the notion of identifying a 
prefetch predicate and the leading reference amongst mul- 
tiple references to an array to facilitate selective prefetching. 
This paper also discusses the interaction of data prefetching 
with other compiler transformations, specifically cache 
blocking and software pipelining. The prefetching algorithm 
disclosed is effective at reducing explicit data prefetch 
overhead. One shortcoming with this approach is that it 
relies on reuse and locality analysis that is rather complex. 
The analysis is done in the context of a high-level optimizer, 
which makes it difficult to estimate the prefetch iteration 
distance because the effects of downstream compiler com- 
ponents (e.g. code generator and low-level optimizer) on the 
loop body are unknown. It is also unclear how cache line 
alignment of prefetched data structures is accounted for 
when memory strides are greater than the cache line size. 
Also, it is unclear whether unnecessary prefetches are 
inserted for certain types of data reuse patterns. For instance, 
for the following C code fragment, the disclosed algorithm 



10 



15 



may actually insert three prefetches when two would be 20 f -i • , ^ TT* a l0W lcvel 

sufficient to ensure full J^tg, ^1™^ mC,udm * a 1**** to the 



where the lower bound on the achievable loop iteration 
latency is unlikely to be increased as a result of the insertion. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block schematic diagram of a uniprocessor 
computer architecture including a processor cache; 

FIG. 2 is a block schematic diagram of a modern software 
compiler; 

FIG. 3 is a schematic representation of a loop; 
FIG. 4 is a schematic representation of a direct mapped 
data cache; 

FIG. 5 is a schematic representation of a loop, including 
a prefetch instruction; 

FIG. 6 is a schematic representation of a loop that has 
been unrolled four times; 

FIG. 7 is a schematic representation of an unrolled loop, 
including a prefetch instruction; 
FIG. 8 is a block diagram showing a low level optimizer 



int At tOOI 100 J; 

far (i = 0; i < 100; i++) 

farG = 0; j < 100;}++) 

{ 

...A[Hli]... 

.Afj+lJIh-lJ. 

> 



SUMMARY OF THE INVENTION 

Assume that a target processor supports a data cache line 
prefetch instruction with the following characteristics: 

It allows a memory address to be specified much like in 
an ordinary load or store instruction; 

If the memory referenced by the prefetch instruction is not 
found in the data cache, the processor causes the 
referenced memory location to be retrieved from lowo- 
levels of the memory hierarchy without stalling the 
execution of other instructions in the processor's 
execution pipelines; and 

The processor docs not signal an exception even when the 
memory address specified by a prefetch instruction is 
invalid 

The current invention provides a new compiler for such a 
processor that facilitates efficient insertion of explicit data 
prefetch instructions into loops within application programs. 
The compiler uses simple subscript expression analysis to 
determine data prefetching requirements. Analysis and 
explicit data cache prefetch instruction insertion are per- 
formed by the compiler in a machine instruction level 
optimizer to provide access to more accurate expected loop 
iteration latency information. 

Such a prefetch instruction insertion strategy tolerates 
worst case alignment of user data structures relative to data 
cache lines. Execution profiles from previous runs of an 
application are exploited in the insertion of prefetch instruc- 
tions into loops with internal control flow. Cache line reuse 
patterns across loop iterations are recognized to ftlimiWi. 
unnecessary prefetch instructions. The prefetch insertion 
algorithm is integrated with other low level optimization 
phases, such as loop unrolling, register ^association, and 
instruction scheduling. 

An alternative embodiment of the compiler limits the 
insertion of explicit prefetch instructions to those situations 



23 



30 



35 



invention; 

FIG. 9 is a block diagram of a prefetch driver according 
to the invention; 

FIG. 10 is a block diagram of a loop body analysis module 
according to the invention; 

FIG. 11 is a block diagram of a module that is used to 
compute prefetch instruction needed for equivalence class 
according to the invention; and 

FIG. 12 is a block diagram of a module that applies a large 
stride cluster identifier to an equivalence class according to 
the invention. 

DETAILED DESCRIPTION OF THE 
INVENTION 



The invention provides a new compiler that facilitates 
efficient insertion of explicit data prefetch instructions into 
loops within applications. FIG. 1 is a block schematic 
diagram of a uniprocessor computer architecture 10 indud- 
40 ing a processor cache. In the figure, a processor 11 includes 
a cache 12 which is in communication with a system bus 15. 
A system memory 13 and one or more I/O devices 14 are 
also in communication with the system bus. 
FIG. 2 is a block schematic diagram of a software 
45 compiler 20, for example as may be used in connection with 
the computer architecture 10 shown in FIG. 1. The compiler 
Front End component 21 reads a source code file (100) and 
translates it into a high level intermediate representation 
(110). A high level optimizer 22 optimizes the high level 
50 intermediate representation 110 into a more efficient form. A 
code generator 23 translates the optimized high level inter- 
mediate representation to a low level intermediate represen- 
tation (120). Hie low level optimizer 24 converts the low 
level intermediate representation (120) into a more efficient 
55 (machine-executable) form. Finally, an object file generator 
25 writes out the optimized low-level intermediate repre- 
sentation into an object files (141). The object file (141) is 
processed along with other object files (140) by a linker 26 
to produce an executable file (150). which can be run on the 
60 computer 10. In the invention described herein, it is assumed 
that the executable file (150) can be instrumented by the 
compiler (20) and linker (26) so that when it is run on the 
computer 10, an an execution profile (160) may be 
generated, which can then be used by the low level optimizer 
65 24 to better optimize the low-level intermediate representa- 
tion (120). The compiler 20 is discussed in greater detail 
below. 



01/29/2004, EAST Version: 1.4.1 



5,704,053 

5 6 

in contrast to previous approaches to the cache miss on account of data cache misses, it it is desirable to initiate 

problem discussed above (see Todd C. Mowry, Tolerating retrieval of data that is not likely to be found in the cache, 

Latency Through Software-Controlled Data Prefetching, in advance of such data being needed by the processor. The 

PhD Thesis, Dept. of Electrical Engineering, Stanford compiler can predict which data is needed in advance for 

University, March 1994; D. Callahan, K. Kennedy, A. 5 loo P s that access array elements in a regular fashion. The 

Porterfield. Software Prefetching, Proceedings of the Fourth compiler can theo insert prefetch instructions into loops such 

International Conference on Ardiitectural Support for Pro- «V elemen ? **** 10 * J"*"*? ^ l °?P 

gramming Languages and Operating Systems, pp. 40-52, ^™J™ J*neved from memory ahead of time. Ideally, 

April 1991; and wf Y. Chen, Ta. Mahie, P. P. Chang, W. ^J?* l ^ U ™ s " ^ elements are 

WHwu, Data access microarchitectures for superscalar 10 **22^^ t ™l " 

... » . a . . actually required by the processor, the array element is 

processors with con^Uer-ass^ data prefetching Proceed- retriev ^ £ m ^ory and placed in the data cache (if it 

mgs of Microcomputing 24, 1991) the new compiler has the wa3 QOt thcrc to bcgi n with). 

following unique attributes: m prior an approaches to prefetching, cache alignment is 

Simple subscript expression analysis is used to determine a problem. Another known problem is the overhead of the 

data prefetching requirements, as opposed to sophisti- 13 prefetch instruction itself. These are very important prob- 

cated reuse/dependence analysis. lems. Run time array dimensioning is yet another problem 

Subscript expression analysis and explicit data cache mat te f^ 8 ^ „ , t ... 

prefetch instruction insertion are performed by the F <* , exam P Ic ^ ™- 3 , a loop is shown that has a loop 

compiler in a low-level, i.e. machine instruction level, °f 10 "? ***** 100 times, 

optimizer. A principal advantage of this approach is 20 ^smg an 8-byte array element on each iteration. If there 

access to more accurate expected loop iteration latency »~ - °^^ < *T° D T* * 1°°° 

. , ^ *^ v * cycles. In FIG. 4, a direct mapped data cache is shown where 

the cache line size is 32 bytes, each line capable of holding 

The prefetch instruction insertion strategy tolerates worst 4 C0Qti ^ 0US g_ bytc ^ dements. For the loop of FIG. 3, 

case alignment of user data structures relative to data ^ u {s assumed ^ a ^ ^ cvery fourth iteratioD 

cache Unes. (on every cache line crossing), which means that 25 data 

Execution profiles from previous runs of an application will occur for the whole loop. If it takes 40 

are exploited in the insertion of prefetch instructions cycles to service each cache miss, the total loop execution 

into loops with internal control flow. time becomes 2000 cycles, Le. 1000 cycles for just execut- 

Cache line reuse patterns across loop iterations are rec- 30 ing the loop instructions +25x40 cycles, or another 1000 

ognized to eliminate unnecessary prefetch instructions. cycles, for the cache misses. 

The prefetch insertion algorithm is integrated with other If known prefetch techniques are used, then for the 

low level optimization phases, such as loop unrolling, example of FIGS. 3 and 4 cache misses can be covered if a 

register reassociation, and instruction scheduling. prefetch distance of four is chosen. In FIG. 5, a prefetch 

An alternative embodiment of the new compiler also 35 instruction is shown inserted into the loop of FIG. 3. As can 

limits the insertion of explicit prefetch instructions to those be seen, the use of a prefetch instruction can eliminate most 

situations where the lower bound on the achievable loop cache misses, thereby saving significant execution time, 

iteration latency is unlikely to be increased as a result of the However, a prefetch instruction requires execution time. In 

insertion. the example herein, each iteration of the loop requires a 

The new compiler yields significant performance 40 prefetch instruction, which can be assumed to take an extra 

improvements for some industry-standard performance cycle. Therefore, for a loop that iterates 100 times, 100 

benchmarks on simulations of the Hewlett-Packard Com- cycles must be added to the execution time to account for 

pany (Palo Alto, Calif.) PA-8000 processor. prefetching. 

The following discussion explains compiler operation in Additionally, the first iteration of the loop incurs a cache 

the context of a loop within an application program. Loops 45 miss, which in the example herein requires 40 cycles, 

are readily recognized as a sequence of code that is itera- Accordingly, prefetching avoids most cache misses, such 

tively executed some number of times. The sequence of such that execution time is reduced to 1140 cycles, i.e. 1000 

operations is predictable because the same set of operations cycles to execute the original loop instructions +100 cycles 

is repeated for each iteration of the loop. It is common for the prefetch instructions+40 cycles for the initial cache 

practice in an application program to maintain an index 50 miss before the first prefetch instruction is executed, 

variable for each loop that is provided with an initial value, Thereafter, the prefetch instructions overlap the 40-cycle 

and that is incremented by a constant amount for each loop data cache miss service time with the execution of four 

iteration until the index variable reaches a final value. The (11-cycle) loop iterations. 

index variable is often used to address elements of arrays Unfortunately, each time through the loop a new prefetch 

that correspond to a regular sequence of memory locations. 55 instruction is executed. Where the unit of transfer between 

Such array references by a loop constitute a significant the main memory and the cache is a cache line, some of the 

portion of cache misses in scientific applications. prefetches are redundant because a prefetch for a particular 

In the compiler, it has been found that the low level array location may refer to the same cache line as the 

optimizer component of a compiler is in a good position to prefetch for subsequent array locations. This redundancy 

deduce the number of cycles required by a stretch of code 60 occurs because there are adjacent array locations in the same 

that is repetitively executed. As discussed above, the concept cache line, and the system is issuing a redundant instruction 

of prefetching is not new. to the memory system to retrieve the same cache line 

Nonetheless it is helpful to explain prefetching at this multiple times. Typically, computer systems that support this 

point. For example, assume that the time that it takes to get type of prefetch instruction track the instructions to deter* 

a data item back from main memory to cache is 100 cycles, 65 mine if a requested address to prefetch a cache line matches 

during which time, the processor must wait idly before it can a later prefetch to the same cache line. In such event, the 

operate on the data. To avoid wasting idle processor cycles second prefetch request to main memory is dropped. 



01/29/2004, EAST Version: 1.4.1 



5,7( 

7 

However, even though redundant prefetches typically get 
dropped, it is nonetheless important that prefetch instruc- 
tions that refer to the same cache line are not executed 
multiple times because the prefetch instruction itself takes 
up some compute time. The processor must fetch and 
execute the prefetch instruction, understand what data 
address the instruction refers to, and then access the data 
cache to check if the data are already cache-resident 

Note that the compiler is responsible far inserting prefetch 
instructions into a loop body that specify the memory 
address of data items that will be accessed in the future. Hie 
memory address is determined based on the number of loop 
iterations in advance (Lc. the prefetch iteration distance or 
PFID) that data items need to be prefetched to fully hide the 
time required to service potential data cache misses. The 
PFID is determined taking into account the nature of the 
loop body instructions and characteristics of the target 
processor and memory system. For instance, for a "short" 
loop, eg. one that takes only two cycles per iteration to 
execute, the PFID would need to be 50 in order to accomo- 
date a 100-cycle data cache miss latency. 

The key to efficient data prefetching then is to overlap the 
computer 1 s execution of the instructions in a piece of code, 
such as a loop, with the time it takes to retrieve the data from 
the memory and place it to the processor cache, and do this 
in a way that avoids redundant prefetches. 

Ideally, cache miss overhead is completely eliminated by 
inserting prefetch instructions judiciously. Referring back to 
the example above where the loop executes 100 iterations, 
with each iteration taking 11 cycles each (10 cycles for the 
original loop body instructions +1 cycle for the prefetch 
instruction +40 cycles for an initial cache miss before 
prefetching starts), the time it takes to run the loop is only 
1140 cycles, which is much better than the 2000 cycles of 
the example above in FIG. 4. 

However, 1140 cycles is still not quite as good as 1000 
cycles. One way to increase further the savings in processor 
execution time is by using a well known technique referred 
to loop unrolling. In loop unrolling, the body of the loop is 
replicated. This reduces the number of times the loop is 
executed by a factor that is equal to the number of 
replications, although each time the code is executed there 
are more instruction to exectue. Thus, in loop unrolling 
exactly the same amount of work is accomplished, but the 
loop is now reorganized. 

Note however, that because the loop closing branch 
doesn't not have to be replicated as many times as the loop 
body is unrolled (in fact an unrolled loop typically needs just 
one loop-closing branch), loop unrolling can by itself result 
in improved performance. 

For example, FIG. 6 shows the loop of HG. 3 after the 
loop has been unrolled four times. Thus, instead of executing 
the loop 100 times, the loop is executed 25 times. Assume 
that a loop-closing branch takes l<ycle to execute. Each 
iteration in the unrolled loop would then require 37 cycles 
(4x9 cycles +1 cycle) and the total loop execution time is 
equal to (25 iterationsx37 cyclcs>K25 iterationsx40 cycles/ 
cache miss )= 1925 cycles. 

In the context of the invention and per the example above, 
if each iteration of the unrolled loop requires 37 cycles, 
where the loop is unrolled four times, it is necessary to 
prefetch data two iterations ahead (since 1 iteration ahead is 
insufficient to accomodate a 40 cycle cache miss latency). If 
the prefetch instruction is put at the bottom of the loop, then 
the loop is executed before a prefetch is performed This 
does not provide optimum operation of the loop. Thus, the 
placement of the prefetch instruction is critical. It is there- 



4,053 

8 

fore necessary to place the prefetch instruction at a point mat 
provides sufficient time for a prefetch before the loop 
completes execution. For example, if the prefetch is placed 
at the top of the loop, then the loop does the same amount 

5 of work, but more effectively overlaps the time to service a 
possible data cache miss for subsequent iterations with the 
computation performed in the current iteration. 

For the example above, where there arc 100 iterations of 
a 10-cycle loop that takes a total of 1000 cycles, the prefetch 

io instructions cost 100 cycles +a 40 cycle cache miss for the 
first iteration. As a result, the execution of the loop is 
reduced from 2000 cycles to 1140 cycles. By adding Loop 
unrolling, in mis example where the loop is unrolled by a 
factor of four (see FIGS. 6 and 7), each iteration of the loop 

15 may take 38 cycles (37 cycles+1 cycle for the prefetch 
instruction). Thus, execution time for the loop is equal to 38 
cycles x25 iterations+80 cycles for two cache misses before 
prefetching begins =1030 cycles. Thus, it is clear that the 
techniques disclosed herein produce a substantial improve- 

20 rnent in the execution time for a loop. 

Note that in some cases the prefetch instruction may not 
cost any additional cycles to execute. This is because many 
modern processors are superscalar, Le. they can execute 
multiple instructions in one cycle, e.g. a load with an add. 

25 Prefetch instructions are similar to a load because they refer 
to memory. Thus, if there are several adds in a loop, adding 
one extra prefetch docs not increase the time necessary to 
execute each iteration of the loop because the prefetch 
instruction is executed in parallel with the add instruction. 

30 One important feature of the invention identifies loops 
and access patterns to allow a determination of how many 
cycles are devoted to loop iterations, and therefore allows 
insertion of the prefetch instruction to a location of an array 
that is sufficiently far in advance to make sure that the miss 

35 time is minimized One problem is that loops can be coded 
in many different ways. It is therefore necessary to recognize 
different types of loops. For example, there are some loops 
that are not always handled by prefetching. 
In the invention, the compiler translates the higher level 

40 application into an instruction stream that the processor 
executes, where the compiler inserts prefetches at opportune 
points into the instruction stream that effects data retrieval 
from main memory into the cache in advance of when that 
data item is actually needed. The compiler anticipates that a 

45 particular data item is going to be needed at a particular time, 
rather than letting the processor execute blindly until it gets 
to that point in time, where the processor stalls and waits for 
several dozens of cycles until that item is fetched from main 
memory. 

50 One advantage in letting the processor continue executing 
the other instructions which may have nothing to do with 
memory is that the system can achieve an overlap. While the 
access time between processor and cache is typically 1 to 5 
cycles, the retrieval time from cache to memory is often on 

55 the order of 10 to 100 cycles. When the processor actually 
gets to the point where the data item is needed, it is not 
necessary to wait for a cache miss that takes 100 processor 
cycles. Thus, instead of waiting 100 processor cycles, it may 
only be necessary to wait for 20 cycles because 80 cycles 

60 worth of look up time is hidden or overlapped with the 
previous execution, 

The invention is preferably implemented in the low level 
optimizer of the compiler to insert prefetch instructions at 
opportune points in the code. In particular, the invention 

65 inserts prefetch instructions into loops. One advantage of 
inserting prefetch instructions into loops is that the data 
reference pattern of a loop tends to be regular and the 



01/29/2004, EAST Version: 1.4.1 



5,704,053 



10 



compiler is better able to predict the kind of memory items 
that are likely to be required in the future, where the future 
is not the current iteration of the loop, but five or six 
iterations in the future. As discussed above, this depends on 
the characteristics of the instructions found within the loop. 
It is therefore necessary to vary the point in time that the 
prefetch request is actually issued, based on the expected 
latency of a loop iteration. 

The compiler is the piece of software that translates 
source code, such as C, BASIC, or FORTRAN, into a binary 
image that actually runs on a machine, Typically the com- 
piler consists of multiple distinct phases, as discussed above 
in connection with FIG* 2. One phase is referred to as the 
front end, and is responsible for checking the syntactic 
correctness of the source code. If the compiler is a C 
compiler, it is necessary to make sure that the code is legal 
C code. There is also a code generation phase, and the 
interface between the front-end and the code generator is a 
high level intermediate representation. The high level inter- 
mediate representation is a more refined series of instruc- 
tions that need to be carried out For instance, a loop might 
be coded at the source level as: 

fotfW), K10, 1=1+1), 

which might in fact be broken down into a series of steps, 
e.g. each time through the loop first load up I and check it 
against 10 to decide whether to execute the next iteration. A 
code generator takes this high level intermediate represen- 
tation and transforms it into a low level intermediate repre- 
sentation. This is much closer to the actual instructions that 
the computer understands. In terms of improving the quality 
of the intermediate representations, a low level intermediate 
representation generated by a code generator is typically fed 
into a low level optimizer. 

An optimizer component of a compiler must preserve the 
program semantics (i.e. the meaning of the instructions that 
are translated from source code to an high level intermediate 
representation, and thence to a low level intermediate rep- 
resentation and ultimately an executable file,) but may 
rewrite or transform the code in a way that allows the 
computer to execute an "equivalent" set of Instructions to be 
executed in less time. 

Modem compilers are structured with a high level opti- 
mizer (HLO) that typically operates on a high level inter- 
mediate representation and substitutes in its place a more 
efficient high level intermediate representation of a particu- 
lar program that is typically shorter. For example, an HLO 
might eliminate redundant computations. 

With the low level optimizer (LLO), the over-arching 
objectives are largely the same as the HLO, except that the 
LLO operates on a representation of the program that is 
much closer to what the machine actually understands. 
Uniquely, the invention performs prefetch analysis and 
prefetch instruction generation in the context of a low level 
optimizer. At this level there are not any semantic 
annotations, but merely instructions, such as add, load, and 
store. The compiler herein identifies repetitive code 
segments, such as loops, for prefetch instruction generation 
in the context of a low level optimizer. 

The analysis that the compiler herein uses is simpler than 
that of the prior art Additionally, because the invention 
operates in the context of a low level optimizer on raw 
instructions, it is much easier to estimate how many pro- 
cessor cycles a loop iteration requires, and therefore how 
many iterations in advance a data prefetch instruction should 
be inserted into the code. 

There are many different organizations that are possible 
for a data cache, but one possible organization that is not all 



10 



15 



20 



25 



30 



35 



40 



45 



50 



55 



60 



65 



that uncommon is to organize it in terms of a series of 
direct-mapped cache lines. Each cache line might be able to 
hold up to 32 bytes of data, such that the unit of transfer 
between main memory and the cache is in chunks of 32 
bytes. The processor can make a request to the memory 
system, and thence data is placed into a well defined location 
in the cache to allow the processor to retrieve the data from 
that location. If the cache in this example is 32,000 bytes in 
size, then it has 1000 lines, each line holding 32 bytes. Any 
explicit data prefetching compiler ultimately has to insert the 
hardware prefetch instructions into the low level code rep- 
resentation. One <ustinguishing feature of the current inven- 
tion is that the analysis required to insert prefetch instruc- 
tions efficiently is also done in the context of a low level 
optimizer. Moreover, the prefetch instruction insertion is 
done a manner that is synergistic with other low level 
optimization, such as loop unrolling, register reassociatioD, 
and instruction scheduling. 

The invention herein resides within the domain of the low 
level optimizer 24. FIG. 8 is a block diagram showing a low 
level optimizer for a compiler, including a prefetch driver 34 
according to the invention. 

The low level optimizer 24 in accordance with the pre- 
ferred embodiment of the invention may include any com- 
bination of known optimization techniques, such as those 
that provide for local optimization 35, global optimization 
36, loop identification 37, loop invariant code motion 38, 
loop unrolling 30, register reassociation 31, and instruction 
scheduling 32. The invention provides a prefetch driver 34 
that operates in concert with such known techniques. 

The following pertains to the various elements of the low 
level optimizer shown on FIG. 8. 
Local optimizations include code improving transforma- 
tions that are applied on a basic block by basic block 
basis. For purposes of the discussion herein, a basic 
block corresponds to the longest contiguous sequence 
of machine instructions without any incoming or out- 
going control transfers, excluding function calls. 
Examples of local optimizations include local common 
subexpression elimination (CSE), local redundant load 
elimination, and peephole optimization. 
Global optimizations include code improving transforma- 
tions that are applied based on analysis that spans 
across basic block boundaries. Examples include global 
common sub-expression elimination, dead code 
elimination, and register promotion that replaces loads 
and stores with register references 
Loop identification is the process of identifying sections 
of code that get executed repetitively (typically this is 
done through interval analysis). 
Loop invariant code motion is the identification of 
instructions Located with a loop that compute the same 
result on every loop iteration and the re-positioning of 
such instructions outside the loop body. 
Register allocation and instruction scheduling is the pro- 
cess of assigning hardware registers to symbolic 
instruction operands and the re-ordering of instructions 
to minimize run-time pipeline stalls. 
FIG. 9 is a block diagram of a prefetch driver according 
to the invention. In the figure, code within the low level 
optimizer (numeric designator 24 on FIG. 8) is supplied to 
the prefetch driver 34. This is shown in the context of the 
low level optimizer of FIG. 8 by points A and B on the 
figure, which points are also shown on FIG. 9 to reference 
the prefetch driver 34 between both of FIGS. 8 and 9. Loop 
unrolling may be incorporated into the invention and is 



01/29/2004, EAST Version: 1.4.1 



5,704,053 

11 12 

shown on FIG. 9 as a module that unrolls the loop a selected «-«P * BIV +*>-«P 

number of times 9Z Loop body analysis 91 is discussed in w 

greater detail below m^nnection with FIG. 10. where "a_exp» and "b.exp are themselves aritoietic 

The prefetch driver estimates the prefetch distance 93 and expressions involving just sum terms each of which is a 

then partitions the memory references occurring in each loop 5 J* 0 *"'* involving either literal or region integer constants, 

into disjoint equivalence classes 94 based on the symbolic The "BIV* term refers to the value of the basic induction 

address expression. Address expression are sorted within variable at the top of the loop entry basic block (the basic 

each equivalence class 95 and prefetch instructions mat are block mat is the target of the branch representing the back 

necessary for each equivalence class are calculated 96. edge of the loop). 

Finally, the prefetch instructions are generated 97. A stream JQ An address expression that can be linearized in this 

of code produced by the prefetch driver includes both the manner is considered to be "affine". The **a_exp" term of an 

low level tatermediate representation and prefetches that affine address expression multiplied by the BlV's net loop 

have been inserted into the intermediate representation in increment is also referred to as the memory stride. Also, 

accordance with the invention herein. associated with each such memory reference is a memory 

The following detailed description pertains to the various ^ ^ that can be inferred from the memory reference 

modules shown on FIG. 9; opcode e.e. a full-word load would be considered to have a 

1. Loop body analysis (see FIG. 10. on which the letters ^ sizc of 4.^. 

G and H correspond to the same letters on FIG. 9): Memory references with affine address expressions 

a. Identify region constants 190. These are pseudo- mvolvm ; BTV with a well-defined net loop increment that 

refers (symb^ £ a ^ whosc <_ C x P " term is 

onlv used and not defined in the loop ooay. ror me purposes 20 c ^ a„+u— 

^fetching analysis, only integeTregion constants are of ™n ; zero « *f only memory references that are further 

u^rta^T analyzed for data prefetching purposes. 

k Identify simple basic loop induction variables 191. A In the example below containing indexed references to 

simple basic induction variable (BIV) is a pseudo-register 4-byte integer arrays. A, B, C, and D, 

whose loop body definitions can all be expressed in the 25 

form: t _ 

BTV=BrV-rt»v_delto ^ . . . A[i + 4] . . . 

where 'In^derta*' is any arithmetic expression involving . . . B[2*i - 2*k + 8) . . . 

only pure or region constants. In the example below, the ^ ; -j CPPII ■ 

variable is a region constant, and the variable T is a i^joop 

BIV, with a single loop body definition whose **biv_delta w _ __ — — — — 
term corresponds to (2*k). 



the variable *T is a BIV and the address expressions 

— associated with the references to A. B, and D would be 

it=... considered affine, since their memory addresses can be 

* * expressed as: 

i = i + (2 • k) (4) * i-K 16>&A[©]) 

eod_kwp 



(8) *i+{32-8*k+&B[0]) 
40 (4) ♦hK&DlOl) 



The net loop increment for a BIV is the total amount by respectively, where the notation, "&X[0r refers to the 

which the BIV is incremented on every loop iteration region constant variable that represents the address of the 

A BIV is said to have a well-defined loop increment if the ^ of x 

total amount by which the BIV is incremented is the same however mat the address expression associated with 

on every loop iteration. The BIV loop increment in this case 45 t0 c is QOt 

is simply the sum of the *1>iv_delta , ' values associated with ~ / T _„ ™ y , f ^^u, 

eachTf its loop body definitions, 2 ' Enroll the loop if possible 

Note that a BIV with conditional loop body definitions a. Compute maximum prefetch unroll factor U, The 

does not have a well-defined loop increment objective here is to determine the largest unroll factor U that 

c. Compute and linearize address expressions for memory can be used to minimize prefetch instruction overhead 

references 192. This involves first identifying the address 50 without causing memory strides that are less than or equal to 

expression associated with memory references. Typically the data cache line size to exceed the data cache line size, 

this is done by recursively tracing back the reaching defi- The maximum prefetch unroll factor, U is computed as 

niri ons for the register operands of base-relative and indexed follows: 

loads and stores that appear in the loop body, and construct- TJ=loop unroll factor computed by other criteria (e.g. loop 

ing a binary address expression tree, where the internal tree 55 s j 2Ci expected trip count, trip count divisibility etc.) 

nodes represent simple arithmetic operations (+,-,*) and the For each affin f > address expression reference associated 

leaf nodes represent either a pure constant a region constant, ^ a BIV with a wen-defined constant net loop increment 

or a BIV. The traceback terminates unsuccessfully when a ** ne t_Joop_deita M do 

non-BIV register operand has multiple reaching definitions 

or when the address expression can not be expressed as a 60 _ 

simple binary expression tree for any other reason. Memory { 

references whose memory address can not be expressed as memory_stride = a_exp • net_Joop_<fcha 

such » binary expression tree are not further considered for * ^^^tT^ 

prefetching purposes. { 

The address expression tree is then linearized, if possible, 65 u = cacht_Jii»_jwze/(ABS(iDeniofy 

with respect to a unique BIV, meaning that it is re- written U = minimum (U, u) 
into the form: 



01/29/2004, EAST Version: 1.4.1 



5,704,053 



13 

•continued 



14 



} 



b. If U>1, then unroll loop body U times and repeat loop 
analysis (steps a-c) on unrolled loop body. 

3. Estimate the minimum required prefetch iteration dis- 
tance (PFID): 

The prefetch iteration distance is the number of loop 
iterations in advance that data should be prefetched to have 
the data be available in the cache when it is needed by the 
processor, assuming the data was not in the cache to begin 
with. The PFID is computed based on the expected cache 
miss latency and the minimum resoun^-constrained latency 
for each loop iteration as follows: 

PFITJaCciliqg ( av p T" ; «1 frfwnnyfanrfl Irmp iteration Ifltmrry) 

There are two competing constraints with regard to the 
PFID choice: 

First, the PFID should be sufficiently large to hide the 
expected average cache miss latency. 

Secondly, the PFID should not be so large that the 
prefetched data is displaced from the cache by an interven- 
ing colliding memory reference before it is actually refer- 
enced. 

It is difficult to determine the optimum average expected 
memory latency. While the best-case round-trip memory 
access latency on one system may be, for example about 50 
cycles, it is likely to be different on another system that uses 
a slower bus. Furthermore, bus contention and memory bank 
conflicts tend to increase the memory access latency. 

Nevertheless, the average miss latency is heuristically 
estimated as the minimum Dumber of processor cycles that 
elapse between the time a request is sent by the processor to 
the data cache and the time the data is forwarded to the 
processor, assuming the data was not present in the cache. 

Estimating the average loop iteration latency is even 
harder to do, even for single-basic block loops. Until sched- 
uling and register allocation are performed, it is not possible 
to know for sure how many cycles a loop iteration is going 
to take. Because it is expensive and difficult to compute the 
achievable loop iteration latency precisely, a lower bound on 
the achievable loop iteration latency based on machine 
resource usage is computed instead. This is quite effective 
for superscalar processors that execute instructions out-of- 
order and arc able to overlap operation latencies at run-time. 
Typically for such machines, instruction retirement band- 
width constrains the execution cycle count the most Thus, 
by focusing on the retirement bandwidth requirements of the 
instructions present in the loop body, a lower bound on the 
achievable loop iteration latency can be computed. 

Certain instructions that are likely to be eventually deleted 
should be ignored in computing the loop iteration latency 
estimate. These may include register-to-register move 
instructions, subscript instructions that may be eliminated by 
reassociation. and floating-point multiplies and adds that 
may be fused into floating-point multiply-and-accumulate 
instructions. 

For instance, suppose a target out-of-order processor can 
retire two memory instructions and two ALU or floating- 
point operations per cycle and suppose the loop body code 
consists of: 

5 memory operations, 

6 ALU operations and 

7 floating-point operations 



10 



IS 



and that three of the ALU operations participate in address- 
ing expressions that are likely to be eliminated through 
register reassociation. The lower-bound on the loop iteration 
latency would then be 5 cycles computed as the larger of 5/2 
and ((6-3H7)/2. 

Now, it is also necessary to address the issue of loops that 
have internal branches. The minimum loop iteration latency 
for such loops is estimated by using previously collected 
execution profile information, which indicates the execution 
count for each basic block in the loop body. The minimum 
cycle count for each basic block is computed based on the 
retirement constraints for the instruction mix within the 
basic block. 

The minimum cycle count is summed over each basic 
block that is executed more than half as many times as the 
loop entry node to yield an estimate for the minimum loop 
iteration latency. 
4. Identify equivalence classes: 
To decide the sort of explicit prefetch instructions to insert 
20 into the loop body, uniformly generated equivalence classes 
of memory references are first identified. These are basically 
disjoint sets of memory references whose address expres- 
sions are known to differ by a compile- time constant This 
is done to help clearly detect group spatial and group 
25 temporal locality among the different memory references, 
which in turn can help reduce the prefetch instruction 
overhead. 

Place each affine address expressions associated with a 
BIV with a compile-time constant net loop increment in a 
distinct group such that all address expressions within a 
group share the following properties: 
they are all associated with the same BIV 
they all have the same "a_exp w term 
their "b^exp" terms differ by a compile-time constant 
The following algorithm is used to do this: 



- let the set of uniformly generated equivalence classes, UGEC = { } 

- add each amne address expression E(biv I a_exp,b_exp), to a work 
listW 

repeat 



30 



35 



{ 



45 



50 



55 



60 



65 



- remove an address expression Ei{bh\ a_exp, b_exp) from the work 
list, W 

• compute (be memory stride M for Ei as (a_exp * net biv loop 



:) 

- if M is not a compile -time constant then substitute some fixed 
constant, C far each non-compik-time constant pseudo-register P t 
that occurs in M and compute me constant-folded memory stride M" 

Also, replace with C occurrences in *b_exp" of any non-comptle- 
tone constant pseudo-register P that occurs ia M, and coastantrfbld 
"b_exp" to yield b_exp' 

* if M is a compile-time constant, then let M" = M and b_exp' = 
K_exp 

for each existing equivalence class Q in UGEC 



{ 



- choose any representative address expression 
Er{biv, a__exp, b_jnp') belonging to Q 

- if biv and a__exp of Er and Ei are not identical, move on to 
the next equivalence class 

- symbolically subtract the b_exp' expression for Ei from Er to 
obtain S 

* if S is a non-zero compile-time constant, add Ei to equivalence 
class Q and move on to consider next address expression on the 



01/29/2004, EAST Version: 1.4.1 



5,704,053 



15 

-continued 



16 

I B{M)xq_o&et-B(i).eq_offset I<=prefetch memory distance 



woik list W 

} 

- if Ei wis not added to any existing equivalence chss> then add a 
ivw cquivakox class X to UGEC ami add Ei to X. Also associate M 
with the newly created equivalence class X 

} until work list W is empty 

Consider each equivalence class, Q, in turn and do the 
following: 

5. Sort the address expressions within each equivalence 
class based on their b_exp' terms and replace multiple 
address expressions with identical b_exp' terms with a 
single representative address expression. 

Since by construction, the address expressions belonging 
to the same equivalence class differ in their b_.exp'terms by 
a simple constant* it should always be possible to sort them 
based on increasing b_exp' values. 

Let "EJow" be the address expression in Q with the 
lowest b„exp' value. Compute a relative equivalence class 
offset, "eq_offset" for each address expression E in Q, as: 

E xq_o2set=E.b_exp'-£LJw. b_exp* 

6. Compute prefetch instructions needed to ensure full 
cache miss coverage for equivalence class Q. 

The goal here is to insert the fewest number of prefetch 
instructions in the loop body to ensure that in the steady- 
state, a prefetch is issued for every distinct cache line 
referenced by the address expressions in the equivalence 
class Q. Unnecessary prefetches are avoided if possible by 
exploiting any group-spatial or group-temporal locality that 
may be apparent among the memory references within each 
equivalence class. 

The method of determining the fewest number of prefetch 
instructions needed to ensure full cache miss coverage 
depends on the magnitude of the memory stride, M\ asso- 
ciated with the equivalence class. 

If M 1 is <^cache line size, then a prefetch strategy suited 
to small strides is employed, otherwise a prefetch strategy 
suited to large strides is used. Note that for large strides, 
cache line alignment of data elements needs to be consid- 
ered. 

In either case, it is first necessary to identify clusters of 
references within the uniformly generated equivalence class. 
A cluster consists of one or more memory references mat 
occur consecutively in the equivalence class list sorted on 
w eq_offsef\ with a well-defined cluster leader. The cluster 
leader is used to generate prefetch data on behalf of all 
members of the cluster. The objective here is to weed out 
those ref s within an equivalence class mat trail other refs 
within the equivalence class. The refs that are still left 
standing are essentially cluster leaders. 

The manner in which memory references are grouped into 
clusters depends on the relative size of the memory stride as 
compared to the cache line size. 

a. Ouster identification for small stride equivalence 
classes. 

It is necessary to consider the address expressions in the 
equivalence-class in the order of increasing **eq_offset" 
values and determine whether each address expression trails 
the very next address expression in the equivalence class and 
if so, drop it from the equivalence class. 

Let B(i) and B(i+1) be adjacent memory refs within the 
sorted equivalence class list When the memory stride is 
<=cache line size. B(i) is considered to be in the same cluster 
as B(i+1), and therefore omitted for prefetch consideration 
iff 



IS 



where the prefetch memory distance is computed as the 
product of PFH) and the effective memory stride, M 1 for the 
5 equivalence class. 

The logic behind this is mat if B(i+1) leads a reference 
B(i) by less man the prefetch memory distance, then there is 
no real point in inserting a prefetch instruction on behalf of 
B(i). While some of the initial FFTD executions of B(i) 
10 within the loop may suffer cache misses, subsequent execu- 
tions of B(i) would either find its data in the cache or have 
to wait much less than a full cache miss latency for its data 
to be retrieved from main memory, since B(i+ 1) or a prefetch 
associated with B(i+1) would have initiated the memory 
retrieval earlier in time. [This is of course assuming that 
conflict/capacity misses haven't displaced the data from the 
cache by the time B(i) catches up with B(i+1)]. 

The last PFID loop iterations can be peeled as described 
in Mo wry et al to avoid the overhead of redundant prefetch 
20 instructions that would be executed for data elements not 
accessed by the original loop. 

As shown on FIG. 9, the module that computes the 
prefetch instructions necessary to determine equivalence 
class 96 is identified by the letters C and D, which letters are 
25 used to indicate a more detailed explanation of the module, 
which is shown on FIG. 11. In the figure, the prefetch driver 
is shown to comprise a module computes the prefetches that 
are needed 96, where small stride prefetch candidates 201 
and large stride prefetch candidates 202 are identified in 
30 accordance with a detection module 200. The algorithm for 
cluster identification with small strides is given below: 

for each address expression "p" in the current equivalence 
class in sorted order, except the very last address expression 



35 



40 



- let q be the next address expression in the equivalence class 

- if (ABS(q*q_o£Eset - pxq_offset) <= ABSQfi • PFID)) 



45 } 



remove p from equivalence class list 



- made p as a leader of a cluster 

- let p4railin&_oftset = p.leading_ofEset = p.eq__offset 



- mark the very last address expression P in the equivalence class list 
as a cluster leader 

- let p.trailmg_offeet = pJeadm&_ofbet = p«q_ofeet 

50 The address expressions remaining in the equivalence 
class after this weeding out process are all cluster leaders. 

b. Cluster identification for large stride equivalence 
classes. 

The algorithm for detecting the fewest number of prefetch 
55 candidates needed for an equivalence class with a large 
stride is unfortunately a bit more complicated than the one 
used for small- stride equivalence classes. The primary 
reason for this is that with large memory strides, the relative 
cache line alignment of the memory refs becomes important 
60 For instance, consider the following *XT loop nest 



inlAt 1001 100]; 
for (i = 0; t < 100; i++) 
fcr(j=0;j<100; j++> 
65 ( 

...AOJBl... 



01/29/2004, EAST Version: 1.4.1 



5,704,053 

17 18 

•continued 



} 



prefetch A(i40+16] ; 
prefetch AJi+3+16] ; 
prefetch A[M+16) ; 



p/f for cluster stragglers 
p/f for cluster stragglers 
p/f far cluster leader 



The above source code fragment strides through the array 
A in large increments for each iteration of the inner j-loop. 
It must be determined whether it is sufficient to insert only 
one prefetch instruction on behalf of A(j][i+1], with the 
assumption that the A[j)[i] reference is a trailing reference. io 
The answer is no because the two references could straddle 
a cache line boundary. If this were the case, then the 
references to A[j][i] could miss the cache, possibly on every 
iteration of the j-loop, even though data is prefetched for the 
Afjlli+1] reference, 15 

However, this is not to say that there is no hope of sharing 
prefetch instructions among references within a uniformly 
generated equivalence class with a large stride. For instance, 
if the first reference in the j-loop above had been to A(j-1] 
[M] instead of A|j][i], clearly one prefetch instruction 20 
would be sufficient for both references. 

To make the system immune from the vagaries of relative 
cache line alignment of references within an equivalence 
class, yet at the same time exploit obvious temporal locality 
among the references, a two-pass strategy is used. This 25 { 
strategy is shown in FIG. 12. In the first pass, it is necessary 
to identify clusters of adjacent references within the current 
equivalence-class, mat are sorted based on their eq__offscts 
205. The distinguishing feature of each such cluster is that 
the references within the cluster share group spatial locality 30 
but no group temporal locality. 

The leading reference within each such cluster is respon- 
sible for prefetching data both for itself and every other 
reference in the cluster. To accommodate bad breaks with 
cache line alignments, a cluster leader may give rise to 35 
multiple prefetch instructions, each spaced a cache line apart 
from the next, until the entire span of the cluster is accounted — 
for. To better explain this, consider the following simple 
loop, which is possibly the result of the loop unrolling step, 
where "A** is a double-precision array variable, i.e. w/S-byte 40 
elements: 



Regardless of the cache line alignment of the cluster 
leader, these three prefetch instructions ensure that all the 
cluster members have memory transactions initiated for data 
that they reference two iterations in advance. To simplify the 
actual generation of these types of prefetch instructions, 
cluster stragglers are removed from the equivalence class 
right away, and the cluster span for the representative cluster 
leader is recorded. 

bi. The high level algorithm for the first pass is shown 
below: 

let q=last address expression in the current equivalence 
class 

mark q as a cluster leader 
let q.trailing_offs et=q.leading__off set=q .eq_off set 
for each address expression "p" in the current equivalence 
class considered in backward sorted order, ignoring "q w 



. compute distance bom current cluster leader, dist_J as follows: 

dbt_l = ABS(q.cq_offset - p.cq_offset) 
. compute distance bom current cluster trailer, dbCJ as follows: 

<£st_t = ABS(q. trail mg_offaet - px>q_ofiEaet) 
- check the following rwo cooditioos to determine if p should be 
included in the cluster headed by q: 

a) dist_t <= cachc_line_size 

b) distal < M* 

if both these conditions arc met, then 
• let q-traikr_offeet = p.eq^offset 

- remove p from the equivalence class 

else 

- kt q = p and mark q as a cluster leader 

- let q tr ailing o ffset = q U*nHmg nffsrt = qjeq__OfiSe1 



LJoop: 
A[i) = 
A[i+1] = 
A[r+2) = 
A(i+3J = 
A[i-+4J = 
A[*+5]- 
A(H*] = 
A[i+7] = 
i = i + 8 

end_i_Joop; 



45 



50 



First of ail, because the loop BIV, has a net loop 
increment of eight, and the element size of "A" is 8-bytes, 55 
this is a large stride equivalence class, assuming a 32-byte 
cache line size (8xS bytes=64 bytes)>32 bytes. 

All eight references to "A" are placed into the same 
cluster because they exhibit group spatial locality, and no 
group temporal locality. The cluster leader is the reference to 60 
A[i+7], and the span of the cluster is 64-bytes (i.e. &A[i+ 
7)-&A(i]). If the prefetch memory distance was computed 
earlier to be 128-bytes, ie. corresponding to a prefetch 
iteration distance of two, it is only necessary to insert three 
prefetch instructions to account for the entire span of this 65 
8-member cluster. These three prefetches essentially 
prefetch the following array elements: 



bJi. Having identified cluster leaders in the first pass, in 
the second pass, the algorithm attempts to exploit temporal 
locality between clusters (see the module identified by 
numeric designator 206 on FIG. 12). This pass is somewhat 
similar to the algorithm used to identify prefetch candidates 
for small-stride equivalence classes. Basically, if two adja- 
cent clusters arc less than the prefetch memory distance 
apart, as measured between the trailing cluster's leader and 
the leading cluster's trailer, then the trailing cluster may a be 
removed from further prefetch consideration. 

However, rather than simply forgetting about the trailing 
cluster, it is necessary to merge the trailing cluster with the 
leading cluster. The algorithm must increase the span of the 
leading representative cluster because the merged cluster's 
span is used later on to determine how many prefetch 
instructions are actually needed. Note that the size of a 
merged cluster can not be allowed to exceed the effective 
memory stride M 1 because otherwise a prefetch instruction is 
needlessly inserted for the merged cluster's trailing refer- 
ence. Instead, a merged cluster's size is clamped to be no 
larger than the effective memory stride. 

Another important consideration in deciding to merge 
clusters is whether the merge would be profitable. In general, 
for a cluster C whose span is "C.s" bytes, "C.p H prefetch 
instructions are inserted. "Cp" can be computed as: 

C.p=(ociljng (Cj/L)1+1 

where "L w is the cache line size. If a cluster C has ll Cb" 
references within it, then unless (C.p <=Cb) .there is no 



01/29/2004, EAST version: 1.4.1 



5,704,053 



19 



20 



If cluster C(i) is merged with cluster C(j), then the span of 
the merged cluster CQ 1 ) is given by: 



benefit from the locality present within the cluster. This 
profitability criterion is indirectly applied both in the con- 
struction of clusters in the first pass, as well as in the 
merging of clusters in the second pass. The profitability 
criterion as applied to the first pass suggests that no two 5 
adjacent references within a cluster should be greater man a 
cache line size apart because, otherwise, it would be better 
to break the cluster into two sub-clusters. Furthermore, the 
span of the ciustcr, as in the distance between the candidate 
cluster trailer and the current cluster leader, must not become 10 
larger than the effective memory stride M\ as explained 
before. A merged cluster's size is therefore clamped to be no 
greater man the effective stride for this reason. 

In the second pass, when deciding whether to merge 
clusters, the number of prefetches that would be needed for is 
the merged cluster using the above formula is compared to 
the sum of the prefetches that would be needed if the clusters 
are not merged. 

One subtlety to the cluster merging pass, is that it may 
sometimes seem unprofitable to merge a cluster C(i) with the 20 
next leading cluster C(i+1), even though in a larger context 
it would have been profitable to merge C(i) with C(i+2)> 
assuming that C(i+2) and C(i) are less than the prefetch 
memory distance apart Thus, non-adjacent clusters are 
examined for merging purposes, and the best cluster to 25 
merge with is selected. 

If it is decided to merge a cluster C(i) with another cluster 
C(j)< the merged cluster leader's relative offset may become 
larger than the relative offset of C(j)'s leading reference. 
Thus, the span of a merged cluster may be larger than simply 30 
the difference in the original relative offsets of the cluster 
leader & cluster trailer. The algorithm used for the second, 
Le. cluster merging, pass which is used to exploit temporal 
locality between clusters is explained below: 

The algorithm scans the remaining address expressions in 35 
the equivalence class, each of which is a cluster leader 
identified in the first pass, linearly forwards and looks for 
merging opportunities with the other leading clusters. In the 
nresentation below, it is assumed for simplicity that the , . - - 

Srifonriy generated equivalence class hasTposMve stride 40 C(j) causing the = system to expend fewer prefetches 

J * ^ - than if C(i) is projected ahead by one more iteration. This 



COO* = MAX {M\ C(f).leadmg_ofeet - CO*) .trailing feet} 
{(he MAX operation clamps the merged cluster aire at M 1 ) 

where Cy**).leading_ofoet = 

MAX {CG)J«do*g_oaBet, <C(i)Jeadins_off8et + (m * M))> 

and CO').tnilin&-o£GBet = 

MIN {CG').n*ilnig_offset T (C(i).trailing_of6set + (m • M*))> 



For the merger of cluster C(i) into cluster C(j) to be 
profitable in terms of reducing the overall number of 
prefetches required, the following is necessary: 

which means that: 

ceiling (CO^VLyc^ccilipg (C(i)ja/Ly*x&xv (C(j)^L)). 

Let the savings accruing from merging cluster C(i) with 
C(j) into a combined cluster C(j') be defined as: 

merg«LJ«vii«0 (ijj , MC(i)-r+CQ).p)-(CG').p* 

The criterion "b" for cluster merging may now be articu- 
lated. For a cluster C(i)» among the leading clusters, CQ) that 
satisfy criterion "a", the system chooses to merge cluster 
C(i) with one of those clusters, CO) for which: 

mergex__savingg (ij j') is maximized and ooc-oegative. 

Note that while the value of 4 *m M computed as mentioned 
above represents the minimum positive integral multiple of 
the stride required to achieve an overlap of C(i) with CQ), it 
is also necessary to check whether using (m-1) could yield 
a smaller span, and hence fewer prefetch instructions for the 
merged cluster. Although projecting the cluster C(i) ahead 
by (m-1) iterations definitely does not cause it to overlap 
with CO)* the projected C(i) leader may get very close to the 



A necessary condition for a trailing cluster C(i) to be 
eligible for being merged with a leading cluster CO) is that 

a) [CO)-treilmg^f^^i)Je>adi^ 



Otherwise, CO) would be leading cluster C(i) by more 
than is desired. 
Now, let: 

m=ceiling [(CO).tiailiiig._ofiEse(-C(i)J«diaj_offiiet) / WT] 



where '1railing_offset H and 4 leading_offset" refer to the 
eq_offset of the cluster trailer and leader respectively. Note 
that the trailing cluster may be the result of a previous cluster 
merger and so it M trailin&_offsef may not correspond to a 55 
memory reference that actually appears in the source code. 

Clearly, it is required that (km<PFID. 

U is also necessary to define the span of a cluster C(x) to 
be: 



can be especially true if the stride is much larger than the 
individual cluster spans. 
Thus, the algorithm scans the remaining address 
45 expressions, which represent clusters identified in the first 
pass, in sorted order and merge trailing clusters with a 
selected leading duster by projecting the trailing cluster 
ahead by either m or (m-1) iterations and checking if both 
criteria "a" and "b M apply. To project a trailing cluster C(i) 
so ahead m iterations to evaluate whether it should be merged 
with a leading cluster CO), the system computes the tentative 
"leading__offset" and **trailing_offset" values for the pro- 
posed merged cluster CO'), thusly: 



C(i)^(i)Jeadir^o^t-C(i).trailfflg„ofi&et 

As mentioned earlier, the number of prefetch instructions 
needed for a distinct cluster C(x) is given by, C(x).p: 

C(x).p=(oeiling (C(x).a>L)}+l 

where L is the cache line size. 



let CQ').leadin*_offset = C(i)Jcading_offset + (m * M") 
let CG>trailing„offoet = C(i).trailiag_offset + (m * M*) 

C(j').leadm&_offiiet = MAX (CG*).leading_offi*t, CO>Jeading_ofeet) 
_ C(j>tiai]ing_offBet = MTN (Cg^-twilicg-offiaet, CO").trailm£_cfifeet) 
60 } else 

CG^ieafina^jofbe* = MIN (C(j')ieadmg_offBet 1 C0)Jeading_offiBet) 
CGitiailfflg^offset = MAX (COOtrailii*g_offset, C0).trailm & _oflfeet) 



65 adjust f-fj*) tr ailing rtffiaftt if needed to ensure C(j')s does sot exceed M* 
as follows: 



01/29/2004, EAST Version: 1.4.1 



21 

-continued 



5,704,053 



22 

-continued 



CO*) J = ABS <C(n.leadmg_offflet - C(j r ).trailing_offset) 

if (CO^ABS^)) 5 
{ 

C#>trailiD&_offaet = r(j»)i^Hit^ offwt _ M' 
COO-a = ABS(W) 

} 



When a cluster 0(i) is successfully merged with a cluster 
C(j) into a cluster CQ% C(i) is removed from the equiva- 
lence class list 

7. Generate the prefetch instructions required for each 
remaining cluster leader. 1S 

It is necessary to consider each cluster leader in turn, and 
where ,4 trailing_offser is different man *leading_offset n 
for any cluster leader, insert as many prefetches as needed to 
cover the cluster's entire span, Le. from 'leading__offset M 
down to <t trailing_offset*\ each prefetch instruction address 20 
spaced L bytes apart 

Mare specifically, if the memory reference corresponding 
to a cluster leader is represented by the instruction: 

25 

load <tisp<Rb),Rt 

where "disp" is a displacement value and Rb and Rt are 
pseudo-registers corresponding to the base register and 
target register of the load, then one or more prefetch instruc- 
tions are inserted into the code stream as follows: 30 

pre fete h_jnst aew_disp (Rb) 

where "new^disp" is computed as disp +(M*PFID) +pf_ 
disp, where "pf_disp M represents the constant offset that 35 
needs to the added to the memory address referenced by the 
cluster leader, to form the base address from which it is 
necessary to prefetch ahead by the prefetch memory dis- 
tance. Hie algorithm used to emit the prefetch instructions is 
given below: 40 



else the "trailer") with new_4isp = disp + (M*PFT0) + 

(6naL_offfiet - C(i)xq^.offiscl) 

} 

} 



Note that in computing ncw_disp, the original memory 
stride value M, computed as (a_exp * net biv Loop 
increment), is used and not the constant folded value M\ 
This may require materializing a run-time region constant 
expression in a register, outside the loop body, and inserting 
an explicit add instruction within the loop body to form the 
prefetch instruction address thusly: 



Rm = (*_exp * oct_biv_bop__incrancnl) • FTTD 
W 

Rx = Rm + Rb 
piefetch_Jnst new_disp'(Rx) 
bad disp(RbXRl 
cnd_loop 



where new_disp*=disp +pf_disp. If the prefetch instruction 
supports an addressing mode which causes the effective 
memory address to be computed as the sum of two register 
values, then the add operation may be omitted by folding in 
the new_disp* value in Rm outside the loop body, yielding 
Rm', and specifying the Rm and Rb registers operands of the 
prefetch instruction directly, as shown below: 



Rm* = ((sl_«p * net_biv_Joop_Jtu3ement) • PFID) +■ new_disp' 
loop 

prcfetch_mst Rm'(Rb) 
k»ddisp(Rb),Rt 
end_loop 



However, if "disp" itself is a run-time value, as opposed 
to a simple constant, then an explicit add operation is 
unavoidable: 



- let L = cache line size for each remaining cluster C(l) in the equivalence 

class 

{ 

- Let disp = displacement of memory reference instruction associated 
with the leader address expression for cluster C(i) 

- oXC^i).teftding_ofbet = C<i).trailk»g_offset) then 
{ 

p(_disp = 0 

- emit prefetch_Jnst with new_disp = disp + (M • PFID) 

} 

else { 

if (M>0) 



- lei cur_offeet = C(i)Jeading_ofiset 

- let finsi_ofiset = C(i).trailmg_ofibet 



}else 
i 

- let c\tr_offisct = Cfi).tmling_ofibet 

- let finaLoffact = C(i)Jeadmg_oflbet 

> 

- let pt_disp = cuL_offiwt - C(i)-eq_on%et 
while (on*_offaet > fmal_ofoet) do 
{ 

- emit prefctch-Jnst with new_disp = disp + 
(MwiU) + p£_disp 

- let cur_ofiset = cur_oSset - L 

- let pf_disp = pf_disp - L 

) 

. emit one final prefetch to the account for the final member 
of the cluster (Le. the "leader" for a negative memory stride, 



Rm' 
loop 



45 



65 



: ((a_exp * net_biv_JoopLJncrement) * PFID) + p£_disp 

Rx = Rb + disp 
prefetch_inst Rxn'(Rx) 
load disp(Rb),Rt 
end_loop 



and if the prefetch instruction does not support a register* 
register addressing mode, then two add operations may be 
needed: 



55 



60 



Rm = (a_exp • net_biv_Joop__ increment) * PFID 
loop 

Rxl = Rb + disp 
Rx2 = Rxl + Rm 
pxefetch_inst p£_disp(Rx2) 
bad disp(Rb),Rt 
ettLJoop 



Note however, that these new add instructions may be 
eliminated through register reassociation. In fact, the 
prefetch instruction(s) and the beneficiary memory reference 
instruction may be able to share the same base register 
through register reassociation, allowing the add instructions 
to be deleted: 



01/29/2004, EAST Version: 1.4.1 



5,704,1 

23 



Rp - mt*H ; *~* to the address of the memory location referenced by the 

load oo the first loop iteration 
Rm = (a_exp • net_bW_)cc^_JncrcmcnO 
Rdelta = (Rm • PHD) + pL_disp 
loop 

prcfetch^inst Rdelta(Rp) 
ksadOCRpXRt 
Rp = Rp + Rm 
eod^Jcop 



Furthermore, if the target architecture supports an auto- 
increment addressing mode (eg. PA-RISC, IBM PowerPC), 
then the increment of the new base register Rp may be 
folded into the load instruction itself. J5 

In terms of the code placement of the prefetch instruction 
itself, to start with* the prefetch instructions) may be placed 
adjacent to the beneficiary memory reference instruction. 
Subsequently, the instruction scheduling phase may re-order 
the prefetch instruction(s) as needed to improve perfor- ^ 
mance. In doing this, memory dependencies between the 
prefetch instruction and other memory references in the loop 
body may be ignored and assuming the prefetch instruction 
is guaranteed not to raise an exception, it may be freely 
scheduled across basic blocks as welL ^ 

Although the invention is described herein with reference 
to the preferred embodiment, one skilled in the art will 
readily appreciate that other applications may be substituted 
for those set forth herein without departing from the spirit 
and scope of the present invention. Accordingly, the inven- ^ 
tion should only be limited by the Claims included below. 

I claim: 

1. A compiler, comprising: 

means in a low level optimizer for analyzing and effi- 
ciently inserting explicit data prefetch instructions into 3J 
loops of applications; 

subscript expression analysis means for determining data 
prefetching requirements; 

means for recognizing cache line reuse patterns across 
loop iterations to eliminate unnecessary prefetch 40 
instructions; and 

means for limiting insertion of explicit prefetch instruc- 
tions to situations where a lower bound on an achiev- 
able loop iteration latency is unlikely to be increased as 
a result of said prefetch instruction insertion. 45 

2. The compiler of claim 1, wherein analysis and explicit 
data cache prefetch instruction insertion are performed by 
said compiler in a machine instruction level optimizer. 

3. The compiler of claim 1, further comprising: 

means for exploiting execution profiles from previous 50 
runs of an application during insertion of prefetch 
instructions into innermost loops with internal control 
flow. 

4. The compiler of claim 1, wherein said prefetch inser- ^ 
tion means is integrated with other low-level optimization 
phases. 

5. The compiler of claim 4, said other low-level optimi- 
zation phases comprising: 

any of loop Unrolling, register reassodation, and instruc- ^ 
tion scheduling. 

6. A method for mitigating or eliminating cache misses in 
a low level optimizer, comprising the steps of: 

performing loop body analysis; 

unrolling loops to reduce prefetch instruction overhead; & 
identifying uniformly generated equivalence classes of 
memory references in a code stream, where said 



24 

equivalence classes represent disjoint sets of memory 
references occurring in a loop whose address expres- 
sions can be expressed as a linear function of the same 
basic loop induction variable and are known to differ 
only by a compile time constant, allowing the detection 
of group spatial and group temporal locality among 
said different memory references; 

computing an effective memory stride for each of the 
equivalence classes; 

detennining the number of prefetch instructions needed 
for full cache miss coverage for each equivalence class, 
where the number of prefetch instructions that needs to 
be inserted is a function of the style of prefetching 
desired, including dumb prefetching that inserts an 
explicit prefetch instruction for each memory 
reference, baseline prefetching that inserts as many 
prefetch instructions as possible without affecting the 
resource minimum loop iteration latency, and selective 
prefetching that inserts as many prefetch instructions as 
are required to ensure full cache miss coverage, exploit- 
ing any group-spatial or group-temporal locality that 
may be apparent among memory references within a 
uniformly generated equivalence class; and 

inserting prefetch instructions identified into said code 
stream. 

7. The method of claim 6, further comprising the step of: 
estimating a prefetch iteration distance for a loop as the 

ratio of average miss latency and average loop iteration 
latency, where the average loop iteration latency is 
derived from a resource-constrained lower bound on a 
cycle count based on machine resource usage. 

8. Tht method of claim 6, further comprising the step of: 
substituting a fixed constant value for unknown terms into 

the address expressions for memory references to run- 
time dimensioned arrays to facilitate partitioning of 
such references into disjoint equivalence classes. 

9. The method of claim 6, further comprising the step of: 
determining the number of prefetch instructions that are 

needed for each uniformly generated equivalence class 
for a selective prefetching strategy, 

10. The method of claim 9, further comprising the step of: 
sorting the address expressions for memory references 

belonging to an equivalence class based on their rela- 
tive constant differences. 

11. The method of claim 9, further comprising the step of: 
determining an effective memory stride for the memory 

references associated with each equivalence class and 
classifying the effective memory stride as being either 
large or small based on whether it is greater than the 
cache line size. 

12. The method of claim 9, further comprising the step of: 
determining prefetch memory distance for the memory 

references associated with each equivalence class as the 
product of effective memory stride and prefetch itera- 
tion distance for the loop. 

13. The method of claim 9, further comprising the step of: 
removing memory references within a small-stride 

equivalence class that trail other memory references 
within said equivalence class by less than the prefetch 
memory distance, wherein memory references that 
remain are cluster leaders. 

14. The method of claim 9, further comprising the step of: 
grouping the memory references belonging to a large- 
stride equivalence class that are sorted by their constant 
address expression differences into clusters each of 



01/29/2004, EAST Version: 1.4.1 



5,704, 

25 

which has a distinct memory reference designated as 
the cluster leader and zero or more memory references 
designated as cluster trailers. 

15. The method of claim S>, further comprising the step of: 
merging clusters represented by their leaders to profitably 5 

exploit group temporal locality in a pairwise fashion. 

16. The method of claim 6, further comprising the steps 

of: 

deciding which equivalence classes to insert prefetch 
instructions for an under the baseline prefetching strat- 10 
egy by first sorting uniformly generated equivalence 
classes based on a prefetch cost/expected benefit 



053 

26 

criteria, and only committing to insert prefetch instruc- 
tions for those equivalence classes with the best cost/ 
expected benefit ratio, without causing resource-based 
minimum loop iteration latency to be exceeded. 
17. The method of claim 6, further comprising the steps 

of: 

running through clusters in each equivalence class; 
generating explicit prefetch instructions for each cluster, 
and 

inserting said prefetch instructions into the code stream, 
***** 



01/29/2004, EAST Version: 1.4.1 



IIHIIIIHHI1IUII1II 

US005797013A 

United States Patent m im patent Number: 5,797,013 

Mahadevan et aL [45] Date of Patent: Aug. 18, 1998 



[54] INTELLIGENT LOOP UNROLLING 

[75] Inventors: Uma Mahadevan. Sunnyvale; Lacfcy 
Shah. Fremont, both of Calif. 

[73] Assignee: Hewlett-Packard Company, Palo Alto. 
Calif. 



[21] Appl. No.: 564,514 

[22] Filed: Nov. 29, 1995 

[51] Int CI. 6 — 



[52] u.s. a « 

[58] Field of Search 



„ G06F 9/45 

...... 395/709; 395/588 

..... 395/705. 709. 

395/588, 580 



[56] References Cited 

U.S. PATENT DOCUMENTS 

5,265353 11/1993 Yamada 395/700 

5367,651 11/1994 Smith ct al 395/700 

5386.562 1/1995 Jain et ai 395/650 

OTHER PUBLICATIONS 

"A Comparative Evaluation of Software Techniques to Hide 
Memory Latency". John et al.. Proc. of the 28 rt Ann. Hawaii 
Int'l Conf., 1995, pp. 229-238. 

"Schedule driven Loop Unrolling for Parallel Processors". 
System Sciences, 1991 Annual Hawaii Int'I Conference. 
1991. vol. n pp. 458-467. 



"Aggressive Loop Unrolling in a Retargetable, Optimizing 
Compiler". Davidson et al-. Dept of Comp, Science. Univ. of 
Va. pp. 1-14. 

"Unrolling Loops in Fortran." Dongarra et al.. Soft. Practice 

and Experience, vol. 9. 1979. pp. 219-226. 

Hendren et at.. "Designing Programming Languages for the 

Analyzabiliry of Pointer Data Structures Comput. Lang.. 

vol. 19, No. 2. pp. 119-134 (1993). 

Weiss et al.. "A Study of Scalar Compilation Techniques for 

Pipelined Supercomputers." ACM. pp. 105-109 (1987). 

Primary Examiner— Emanuel Todd Voeltz 
Assistant Examiner — Kakali Chaki 



[57] 



ABSTRACT 



A compiler facilitates efficient unrolling of loops and 
enables the elimination of extra branches from the loops, 
including the elimination of conditional branches from 
unrolled loops with early exits. Unrolling also enhances 
other optimizations, such as prefetch, scalar replacement, 
and instruction scheduling. The unroll factor is calculated to 
determine the amount of loop expansion and the optimum 
location to place compensation code to complete the original 
loop count. i.e. before or after the unrolled loop. The 
compiler is applicable, for example, to modern RISC 
architectures, where the latency of memory references and 
branches is higher than that of integer and floating point 
arithmetic instructions. 

16 Claims, 13 Drawing Sheets 



a* 30 



rV 



Loops 



Analysis 
Module 



j 



r 



Optimizer 
Module 



J 



r 



Unroll 
Module 



Compensation 
Module 



1301 



1302 



r 



1303 



r 



1304 



t 

Unrolled 
Loops 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Aug. is, i<m sh<*t i of u 5,797,013 



PROCESSOR 



11 




12 



15 



13 



14 



MEMORY 




Fig. 1 (PRIOR ART) 



01/29/2004, EAST version: 1.4.1 



U.S. Patent Aug. is, ms sheet 2 of 13 5,797,013 




01/29/2004, EAST Version: 1.4.1 



U.S. Patent Aug. 18, 1998 

120 



Sheet 3 of 13 



5,797,013 




Fig. 2b 

(PRIOR ART) 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Aug. 18, 1998 Sheet 4 of 13 5,797,013 



I=N 

BEGIN.LOOP: 

A[I] = B[I]+X 
1 = 1-1 

IF (I>0) GOTO BEGIN_LOOP; 



150 



END_LOOP: 
PRINT I 



Fig. 3 

(PRIOR ART) 



I = N 

BEGIN LOOP: 



A[I] = B[I] + K 
1 = 1-1 



401 



IF ( 1 < 0) GOTO ENDLOOP 

^-405 



C[I]=A[I] + X 



GOTO BEGIN_LOOP 
END_LOOP: 

PRINT I 



403 



Fig. 7a 

°(PRIOR ART) 



BEGIN LOOP: 



A[I] = A[I-2] / B[I] 
1 = 1 +1; 



367 



IF (I < N) GOTO BEGINLOOP 
END LOOP: 

Fig. 9 

(PRIOR ART) 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Aug. 18, 1998 



Sheet 5 of 13 



5,797,013 



301 



I = N 

IF (I<4) GOTO BEGIN_COMPENSATION_LOOP 
UNROLLED LOOP: 



A[I] = B[I] + X 
1 = 1-1 



A[I] = B[I] + X 
1 = 1-1 



A[I] = B[I] + X 
1 = 1-1 



A[I] = B[I]+X 
1 = 1-1 





159 



IF (I>=4) GOTO UNROLLED_LOOP 
END_UNROLLED_LOOP: 

IF (I = 0) GOTO END_COMPENSATION_LOOP 
BEGIN COMPENSATION_LOOP: 




303 



A[I] = B[I] + X 
1 = 1-1 

IF (I>0) GOTO BEGIN_COMPENSATION_LOOP; 



END_COMPENSATION_LOOP: 
PRINT I 



Fig. 4 

(PRIOR ART) 



155 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Aug. is, im sheet 6 of 13 5,797,013 



311 



I = N 

IF (1 < 5) GOTO BEGIN_COMPENSAT10N_LOOP 
UNROLLED JLOOP: 



A[I] = B[I] + X 
1 = 1-1 



A[I]=B[I] + X 
1 = 1-1 



A[I] = B[I] + X 
1 = 1-1 



A[l]=B[I]+X 
1 = 1-1 




J 



321 



IF (I >= 5) GOTO UNROLLEDJLOOP 
END_UNROLLED_LOOP: f J 15 

COMPENSATION.LOOP: J 



A[I]=B[I] + X 
1 = 1-1 

IF (I>0) GOTO BEGIN_COMPENSATION_LOOP; 



END_COMPENSATION_LOOP: 
PRINT 1 



Fig. 5 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Aug. 18, 1998 



Sheet 7 of 13 



5,797,013 



j = ( (N-l) %) 4 + 1 NOTE: ( (N-=l) MOD 4) + 1 

I = N - J ; NOTE THAT I IS GUARANTEED 

TO BE DIVISIBLE BY 4 HERE. < "OZU 

COMPENSATION_LOOP: 



AM = B[J] + X 

IF (J > 0) GOTO BEGIN_COMPENSAT10N_LOOP: 



END_COMPENSATION_LOOP: 
IF (I = 0) GOTO END_UNROLLED_LOOP 
UNROLLED_LOOP: 



A[I] = B[I] + X 
1 = 1-1 



All] = B[I] + X 
1 = 1-1 



A[1] = B[I] + X 
1 = 1-1 



A[I]=B[I] + X 
1 = 1-1 




IF (I > 0) GOTO UNROLLED_LOOP 
END_UNROLLED_LOOP: 
PRINT I 



Fig. 6 



383 



01/29/2004, EAST version: 1.4.1 



U.S. Patent 



Aug. 18, 1998 



Sheet 8 of 13 



5,797,013 



I = N 

BEGIN.LOOP: 
UNROLLED LOOP: 



311 



A[I] = B[I] + K 
1 = 1-1 

IF ( I < 0) GOTO END_COMPENSATION_LOOP; 
C[I] = A[IJ+X 



A[I] = B[I] + K 
1 = 1-1 

IF ( 1 < 0) GOTO END_COMPENSATION_LOOP; 
C[I] = A[I] + X 



A[IJ = B[I] + K 
1 = 1-1 

IF ( I < 0) GOTO END_COMPENSATION_LOOP; 
C[I] = A[I]+X 



A[I] = B[I] + K 
1 = 1-1 

IF ( I < 0) GOTO END_COMPENSATION_LOOP; 
C[I] = A[I]+X 



END_UNROLLED_LOOP: 
BEGIN_COMPENSATION_LOOP: 



A[I] = B[I] + K 
1 = 1- 1 



365 



IF (I < 0) GOTO END_COMPENSATION_LOOP; 



C[I] = A[I] +X 



GOTO BEGIN_COMPENSATION_LOOP 
END_LOOP: 
PRINT I 

Fig. 7b 

(PRIOR ART) 



01/29/2004, EAST Version: 1.4-1 



UJS. Patent 



I = N 



Aug. 18, 1998 



Sheet 9 of 13 



5,797,013 
311 



IF (I<5) GOTO BEGIN_COMPENSATION_LOOP 
UNROLLED LOOP: 



A[I] = B[I] + K 

1 = 1 - 1 

C[I] = A[I] + X 



A[I] = B[I] + K 

1 = 1-1 

C[I] = A[I] + X 



A[I] = B[I] + K 

1 = 1 - 1 

C[I] = A[I] + X 



A[I] = B[I] + K 

1 = 1 - 1 

C[I] = A[I] + X 




319 



IF (I>=5) GOTO UNROLLEDJLOOP; 
END_UNROLLED_LOOP: 
BEGIN_COMPENSATION_LOOP: 

361 




312 



A[I] = B[I] + K 
1 = 1-1 




365 



IF (I<0) GOTO END_COMPENSATION_LOOP; 

363 



C[I] = A[I] + X 



GOTO BEGIN_COMPENSATION_LOOP 
END_LOOP: 

PRINT I F j„ g 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



T = AfI-2] 
Tl = A[M1 



Aug. 18, 1998 Sheet 10 of 13 5,797,013 

388 

Fig. 10 

(PRIOR ART) 



A[I]=T/B[I1 
T = T1 < — 
Tl = A [I] ^- 
1 = 1+1; 




IF (I < N) GOTO BEGIN_LOOP 



393 

395 



END LOOP: 



T = A[I-2] 
Tl = A[I-1] 

BEGIN LOOP: 



A[I] = T/B[I] 

T = Tl < 

T1 = A[I] 
I = I+,1; 



A[I]=T/B[I] 
T = T1 
Tl = A[I] 
1 = 1+1; 



A[I]=T/B[I] 

T = T1^ 

Tl = A[I] 
1 = 1 + 1; 



388 



Fig. 1 1 

391 (PRIOR ART) 

W93 
395 




391 
393 
395 

391 
393 

395 



IF (I < N) GOTO BEGINJLOOP 
END LOOP: 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Aug. 18, 1998 Sheet 11 of 13 5,797,013 



T = A[I-2] 
Tl =A[I-1] 



400 



BEGIN.LOOP: 



A[I]=T/B[I] 
T2 = A[I] 



A[I+1] = T/B[I+1] 
T = A[I+1] 



A[I+2] = T2/B[I+2] 
Tl = A[I+2] 



401 



402 



403 



410 



1 = 1 + 3; 
IF (I < N) GOTO BEGIN_LOOP 
END_LOOP: 



Fig. 12 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Aug. 18, 1998 



Sheet 12 of 13 

201 



5,797,013 



Si 



DETERMINE TRIPCOUNT IF KNOWN 

DETERMINE UNROLL FACTOR FROM LOOPLENGTH , PREFETC H 
SPAN,PROF!LE,RESOURCE USAGE 



I 



IF TRIPCOUNT KNOWN AND TRIPCOUNT \f 
MOD UNROLLFACTOR = 0 



203 



222 



YES 



NO 



ULCODE= 

UNROLL(LOOP,UNROLLFACTOR,NOCOMP) 

REMOVEMIDDLEEXITS (ULCODE) 

EXCEPT THE FINAL EXIT 



205 



ULCODE= 

UNROLL(LOOP,UNROLLFACTOR,COMPOK) 
REMOVEMIDDLEEXITS (ULCODE) 



242 



OUTPUT(ULCODE) 



NO 




226 



ULCODE=MOVE FINAL 
EXIT TO END OF LOOP 
(ULCODE) 



244 



? 



5 



229 



IF PLACECOMP( UNROLL FACTOR) = BEFORE 



I 



OUTPUT ULCODE 
OUTPUT COMPCODE FOR LOOP 



1 



246 



OUTPUT COMPCODE FOR LOOP 
OUTPUT ULCODE 



999 



DONE 



Fig. 13 



01/29/2004, EAST version: 1.4.1 



U.S. Patent Aug. is, ms sheet 13 of 13 5,797,013 



24 



30 



Loops 



Analysis 
Module 



r 



Optimizer 
Module 



r 



Unroll 
Module 



J 



Compensation j 
Module 



1301 



1302 



1303 



r 



1304 



Unrolled 
Loops 



Fig. 14 



01/29/2004, EAST version: 1.4.1 



5.75 

1 

INTELLIGENT LOOP UNROLLING 

BACKGROUND OF THE INVENTION 

1. Technical Field 

The invention relates to compilers. More particularly, the 
invention relates to techniques for unrolling computation 
loops in a compiler so as to generate code which executes 
faster. 

2. Description of The Prior Art 

FIG. 1 is a block schematic diagram of a uniprocessor 
computer architecture, including a processor cache. In the 
figure, a processor 11 includes a cache 12 which is in 
communication with a system bus 15. A system memory 13 
and one or more I/O devices 14 arc also in communication 
with the system bus. 

FIG. 2a is a block schematic diagram of a software 
compiler 20. for example as may be used in connection with 
the computer architecture shown in FIG. 1. The compiler 
Front End component 21 reads a source code file (100) and 
translates it into a high level intermediate representation 
(110). A high level optimizer 22 optimizes the high level 
intermediate representation 110 into a more efficient form. A 
code generator 23 translates the optimized high level inter- 
mediate representation to a low level intermediate represen- 
tation (120). The low-level optimizer 24 converts the low 
level intermediate representation (120) into a more efficient 
(rnachine-executable) form. Finally, an object file generator 
25 writes out the optimized low-level intermediate repre- 
sentation into an object file (141). The object file (141) is 
processed along with other object files (140) by a linker 26 
to produce an executable file (150), which can be run on the 
computer 10. In the invention described herein, it is assumed 
that the executable file (150) can be instrumented by the 
compiler (20) and linker (26) so that when it is run on the 
computer 10. an execution profile (160) may be generated, 
which can then be used by the low level optimizer 24 to 
better optimize the low-level intermediate representation 
(120). The compiler 20 is discussed in greater detail below. 

The compiler is the piece of software (hat translates 
source code, such as C. BASIC, or FORTRAN, into a binary 
image that actually runs on a machine. Typically the com- 
piler consists of multiple distinct phases, as discussed above 
in connection with FIG. 2a. One phase is referred to as the 
front end, and is responsible for checking the syntactic 
correctness of the source code. If the compiler is a C 
compiler, it is necessary to make sure that the code is legal 
C code. There is also a code generation phase, and the 
interface between the front-end and the code generator is a 
high level intermediate representation. The high level inter- 
mediate representation is a more refined series of instruc- 
tions that need to be carried out. For instance, a loop might 
be coded at the source level as: 

which might in fact be broken down into a series of steps, 
e.g. each time through the loop, first load up I and check it 
against 10 to decide whether to execute the next iteration. 

A code generator takes this high level intermediate rep- 
resentation and transforms it into a low level intermediate 
representation. This is closer to the actual instructions that 
the computer understands. An optimizer component of a 
compiler must preserve the program semantics (i.e. the 
meaning of the instructions that are translated from source 
code to an high level intermediate representation, and thence 
to a low level intermediate representation and ultimately an 



►7.013 

2 

executable file), bul rewrites or transforms the code in a way 
that allows the computer to execute an equivalent set of 
instructions in less time. 
Modern compilers are structured with a high level opti- 

5 mizer (HLO) that typically operates on a high level inter- 
mediate representation and substitutes in its place a more 
efficient high level intermediate representation of a particu- 
lar program that is typically shorter. For example, an HLO 
might eliminate redundant computations. With the low level 

10 optimizer (LLO). the objectives are the same as the HLO. 
except that the LLO operates on a representation of the 
program that is much closer to what the machine actually 
understands. 

FIG. 2b is a block diagram showing a low level optimizer 
15 for a compiler, including a loop unrolling component 30 
according to the invention. The low level optimizer 24 may 
include any combination of known optimization techniques, 
such as those that provide for local optimization 35. global 
optimization 36. loop identification 37. loop invariant code 
20 motion 38. prefetch 34jcgister reassociation 31. and instruc- 
tion scheduling 32. 

Source programs translated into machine code by com- 
pilers consists of loops. e.g. DO loops. FOR loops, and 
WHILE loops. Optirnizing the compilation of such loops 
25 can have a major effect on the run time performance of the 
program generated by the compiler. In some cases, a sig- 
nificant amount of time is spent doing such bookkeeping 
functions as loop iteration and branching, as opposed to the 
computations that are performed within the loop itself. 
30 These loops often implement scientific applications that 
manipulate large arrays and data instructions, and run on 
high speed processors. 

This is particularly true on modern processors, such as 
RISC architecture machines. The design of these processors 
33 is such that in general the arithmetic operations operate a lot 
faster than memory fetch operations. This mismatch 
between processor and memory speed is a very significant 
factor in limiting the performance of microprocessors. Also, 
branch instructions, both conditional and unconditional. 
AO have an increasing effect on the performance of programs. 
This is because most modern architectures are super- 
pipelined and have some sort of a branch prediction algo- 
rithm implemented- The aggressive pipelining makes the 
branch misprediction penalty very high. Arithmetic instruc- 
45 tions are imerregister instructions that can execute quickly, 
while the branch instructions, because of mispredictions, 
and memory instructions such as loads and stores, because 
of slower memory speeds, can take a longer time to execute. 
Modern compilers perform code optimization. Code opti- 
50 mization consists of several operations that improve the 
speed and size of the complied code, while main t ainin g 
semantic equivalence. Common optimizations include: 
prefetching data so that they are available in cache 
memory when needed; 
55 detecting calculations as computing constants and per- 
forming the calculation at compile time; 
scalar replacement which keeps the value of a variable in 
a register within the loop; 
^ moving calculations outside of loops where possible; and 
performing code scheduling, which consists of rearrang- 
ing the order of and modifying instructions to achieve 
faster running but semantically equivalent code. 
Many modern compilers also employ an optimizing tech- 
65 nique known as loop unrolling to generate faster running 
code. In its essence, loop unrolling takes the inner loop. i.e. 
the code between the beginning and the end of the loop, and 



01/29/2004, EAST Version: 1.4.1 



5.7< 

3 

repeats it in the i nner loop some number of times, e.g. four 
times. It then executes the unrolled loop one-fourth as many 
times as it would have executed the original loop. The 
number of times the loop is replicated within the unrolled 
loop is called the unroll factor. Because the number of times 
the original loop is executed is not always divisible by the 
unroll factor, a compensation loop code often has to be 
generated to execute the remaining of instructions of the 
original loop that are not executed by the unrolled loop. 

As discussed above, such loops as DO. FOR and WHILE 
loops are common in programs, especially in scientific and 
other time-consuming programs. Frequently 80% of the 
running time of a program can be in a few small loops. As 
a result anything that can speed up such loops is of great 
value in making a more efficient compiler. 

Consider the simple loop shown on FIG. 3. The three 
instructions in the inner loop 150 are executed N times. 
According to the prior art this loop can be unrolled with an 
unroll factor of four to produce the code shown on FIG. 4. 
where the inner loop (less the exit condition) is replicated 4 
times 333. This loop is also followed by a test 303 to see if 
the full original loop has been completed, and a compensa- 
tion loop 155 which is executed to the complete the original 
loop trip count if it has not been completed. 

An inspection of the loop shows that it is semantically 
equivalent to the loop of FIG. 3 because the same function 
is performed and the same result is achieved. However, the 
loop of FIG. 4 runs much faster for several reasons. First, the 
conditional branch 159 which exits the loop is executed only 
once for each time through the unrolled loop rather than 
once for each time through the original loop. Assuming an 
unroll factor of four as is shown here, this saves % of a 
conditional branch per original loop iteration. More 
importantly, other compiler optimizations interact with loop 
unrolling and are able to do a much better job of optimizing 
the unrolled loop, such as that identified by numeric desig- 
nator 333. as compared to the original loop, identified by 
numeric designator 150. In an unrolled loop, there are more 
operations that could be scheduled in parallel, more oppor- 
tunity to do scalar replacement and other optimizations, and 
more possibilities to do prefetching. 

The tests identified by numeric designators 301 and 303 
are also of interest. These are the conditional branches which 
have a higher probability of being mispredicted. Anything 
that can be done to eliminate one of them will be very useful. 

Prior art loop unrolling techniques have certain disadvan- 
tages. For example. J. J. DongaraandA. R. Hinds. Unrolling 
Loops in FORTRAN, describes how one can unroll loops 
manually by duplicating code. This is an early solution to 
optimizing code that it is not even implemented by the 
compiler. 

S. Weiss and J. E. Smith. A Study of scalar compilation 
techniques for pipeline supercomputers, discuss unrolling in 
the compiler Here the authors address the simple situations: 

a) Cases where the loop count is known at compile time. 
They do not address loop unrolling when the loop count 
is only known at run time. 

b) Cases where the loop exit appears only at the beginning 
or end of the loop. They do not address the situation of 
unrolling loops with early exits (loops whose exit may 
occur in the middle of the loop). 

The authors do not address the following issues: 

a) Determining whether to place the compensation code 
before or after the unrolled loop. 

b) Timing of the iteration count to reduce branch mispre- 
diction, 

c) Factors that affect the unroll factor. 



17.013 

4 

L. J. Hendren and G. R- Gao. Designing Programming 
Languages for the Analyzabiliry of Pointer Data Structures. 
addresses the issue of unrolling loops as part of compiler 
optimization. They do not however discuss: 
5 a) Unrolling loops with early exits: 

b) Tuning of the iteration count to reduce branch mispre- 
diction; 

c) Factors that affect the unrolJ factor; 

10 d) Whether to place a compensation code at the beginning 
or end of the loops; and 
e) Compiling loops whose nip count is only known at run 
time. 

J. Davidson. S. Jinturkar. Aggressive Loop Unrolling in a 
15 Re targe table, Optimizing Compiler. Dept. of Computer 
Science, Thornton Hall. University of Virginia disclose a 
code transformation* referred to as aggressive loop 
unrolling, in a retargetable optimizing compiler where the 
loop bounds are not known at compile time. Various factors 
20 were analyzed to determine how and when loop unrolling 
should be applied, resulting in an algorithm for loop unroll- 
ing in which execution-time counting loops (i.e. a counting 
loop whose iteration count is not trivially known at compile 
time) are unrolled and loops having complex control-flow 
25 are unrolled. However, they do not discuss: 

a) Unrolling loops having early exits; 

b) Tuning of the iteration count to reduce branch mispre- 
diction 

c) Factors that affect the unroll factor; and 

30 d) Whether to place a compensation code at the beginning 
or end of the loops. 
Another part of the loop unrolling prior art is shown on 
FIG. 7. which illustrates a loop having an early exit (also 
referred to as a WHILE loop), consisting of an exit test and 

35 branch 403 in the middle of the loop between computations 
401 and 405. According to the prior art. the code, including 
the exit 403. is replicated four times in an unrolled loop. The 
number of branches in the loop are however not reduced 
with this optimization. 

40 While some of the techniques discu sscd in the prior art are 
applicable to compilers for all computers, only some of them 
are particularly applicable for modern RISC computers, 
where branch instructions form a lot bigger bottleneck than 
in earlier technologies. Also compilers for RISC architec- 

45 tures are a lot more aggressive and the interactions of 
various optimizations plays a key role in the quality of the 
final code. 

SUMMARY OF THE INVENTION 

50 

The invention provides a new compiler that can unroll 
more loops than previous algorithms. It also significantly 
reduces the number of branch instructions by cleverly han- 
dling the iteration count and by converting loops with early 
55 exits to regular FOR loops. The invention also provides for 
computing the unroll factor and the placement of the com- 
pensation loop by taking a lot of other optimizations into 
consideration. 

The compiler: 

6o Eliminates time consuming conditional branch instruc- 
tions from the compensation code loop by replacing the 
conditional exit of the main unrolled loop to always 
exit with at least one iteration which has yet to be 
executed by the compensation code. This eliminates the 

65 need to test for zero remaining loops. 

Determines whether it is better to place the compensation 
code at the beginning or the end of the unrolled loop 



01/29/2004, EAST Version; 1.4.1 



5.797.013 



6 



according to which one would likely provide the better 
optimization. Generally, it prefers to put the compen- 
sation loop in front of the main loop if the unroll factor 
is a power of two and after the main loop if the unroll 
factor is not a power of two. 

Computes the unroll factor by taking into account the 
interactions of other optimizations like prefetch, scalar 
replacement and register allocation, and also talcing 
into account hardware features like number of func- 
tional units. Unrolling loops over-aggressively or 
under-aggressively can inhibit other optimizations or 
make them less effective. 

Converts loops with early exit to loops with exit at the end 
to apply more efficient optimizations to the loop. It does 
this by ensuring that the compensation code is always 
executed at least once, enabling the compiler to elimi- 
nate the exit tests from the unrolled loop. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 shows a block schematic diagram of a uniprocessor 
computer architecture including a processor cache; 

FIG. 2a shows a block schematic diagram of a modem 
software compiler; 

FIG. 2b shows schematic diagram of the low level opti- 
mizer; 

FIG. 3 shows a simple program loop; 

FIG. 4 shows the loop of FIG. 3 that has been unrolled 
four times according to the prior art; 

FIG. 5 shows the loop of FIG. 3 unrolled according to the 
invention and having an eliminated branch; 

FIG. 6 shows an unrolled loop having pre-loop compen- 
sation code; 

FIG. la shows a simple WHILE loop; 

FIG. lb shows the WHILE loop of FIG. la that has been 
unrolled according to the prior art; 

FIG. 8 shows the WHILE loop of FIG. la that has been 
unrolled according to the invention; 

FIG. 9 shows a schematic representation of a simple loop; 

FIG. 10 shows the loop of FIG. 9 with scalar replacement 
according to the prior art; 

FIG. 11 shows the loop of FIG. 10 unrolled; 

FIG. 12 shows a schematic representation of the loop of 
FIG. 11 after copy elimination; 

FIG. 13 shows a schematic representation of the compiler 
logic which determines the unroll factor, compensation code 
placement, and other optimizations; and 

FIG. 14 is a block schematic diagram of a compiler for a 
programmable machine in accordance with the invention. 

DETAILED DESCRIPTION OF THE 
INVENTION 

The invention provides a new compiler that features smart 
unrolling of loops. 

The invention provides a prefetch driver 34 that operates 
in concert with such known techniques. The following 
discussion pertains to the various elements of the low level 
optimizer shown on FIG. 2b: 

Local optimizations include code improving transforma- 
tions that are applied on a basic block by basic block basis. 
For purposes of the discussion herein, a basic block corre- 
sponds to the longest contiguous sequence of machine 
instructions without any incoming or outgoing control 
transfers, excluding function calls. Examples of local opti- 



25 



mizations include local common sub-expression elimination 
(CSE). local redundant load elimination, and peephole opti- 
mization. 

Global optimizations include code improving transforma- 
5 tions that are applied based on analysis that spans across 
basic block boundaries. Examples include global common 
sub-expression elimination, loop invariant code motion, 
dead code elimination, register allocation and instruction 
scheduling. 

10 Loop invariant code motion is the identification of 
instructions located with a loop that compute the same result 
on every loop iteration and the re-positioning of such 
instructions outside the loop body. 

Register allocation and instruction scheduling is the pro- 
15 cess of assigning hardware registers to symbolic instruction 
operands and the re-ordering of instructions to minimize 
run-time pipeline stalls where the processor must wait on a 
memory fetch from main memory or wait for the completion 
of certain complicated instructions that take multiple cycles 
20 to execute (eg. divide, square root instructions) . 

One important phase of the compiler identifies loops and 
access patterns to estimate how many cycles are devoted to 
loop iterations. In the invention, the compiler translates the 
higher level application code into an instruction stream that 
the processor executes, and in the process of this translation 
the compiler unrolls loops. 

The longer unrolled loops allow the compiler to provide 
several advantages, such as: 
30 1) It eliminates the extra branch exits. This saves CPU 
cycles by not having to execute the branch instructions 
and also helps reduce branch misprediction. Why this is 
important is evident if one considers most modem 
RISC architectures. These architectures have a long 
35 pipeline that is fed by an instruction fetch mechanism. 
When the fetch mechanism encounters a branch, it tries 
to predict if the branch is going to be taken or not. It 
then fetches instructions based on this prediction. The 
prediction is necessary to keep the pipeline from stall- 
40 ing. If the architecture's prediction is correct (this is 
determined when the branch instruction completes 
execution which is a few cycles after it has been 
fetched), then everything works fine; else all the 
instructions that have been fetched after the branch are 
45 discarded and the new instructions fetched based on the 
correct outcome of the branch. This penalty of discard- 
ing fetched instructions and fetching new ones* when 
the branch is mispredicted, is known as the branch 
misprediction penalty and it is very significant for most 
50 modern architectures. It is of the order of 5-10 cycles 
per branch instruction that is mispredicted. By reducing 
the branch instructions, the number of branches that get 
mispredicted automatically reduces. 
2) It can better insert prcf etches and effect other optimi- 
55 zations into the longer inner loop code. When the loop 
is unrolled, there are more memory instructions in the 
loop and also the memory stride (the distance between 
the memory accesses of an instruction in two consecu- 
tive iterations) is bigger. If a loop is unrolled four times. 
60 the memory stride goes up by four. This helps the 
prefetch to do a more effective job. When the memory 
stride increases, as long as it is less than the cache line 
size (which is architecture dependent), the prefetches 
become more effective. When the memory stride 
65 becomes greater than the cache line size (he prefetches 
can hurt Hence the loops should be unrolled such that 
the memory stride is lesser than the cache line size 



01/29/2004, EAST Version: 1.4.1 



5.797.013 



7 

whenever prefetch instructions are going to be gener- 
ated for the loop and whenever possible. 

3) Scalar replacement/recurrence elimination inserts cop- 
ies at the loops to keep the value of a variable live in 
the next iteration (see. for example FIGS. 9. 10, and 
11). These copies can be eliminated by unrolling the 
loop a certain number of times. 

4) Longer sequences of non-branching instructions can 
achieve an overlap between instructions that have noth- 
ing to do with memory and those that do. This is known 
as instruction scheduling and explained below. While 
the access time between the processor and the cache is 
typically 1 to 5 cycles, the retrieval time from cache to 
memory is often on the order of 10 to 100 cycles. When 
the processor actually gets to the point where the data 
item is needed from memory, if the data is not in cache, 
it might take 100 processor cy des to fetch it from main 
memory. Where the compiler can optimize the longer 
inner loop code, it may only be necessary to wait for 20 
cycles because 80 cycles worth of look up time is 
hidden or overlapped with the execution of other 
instructions. 

Loop unrolling is integrated with other low level optimi- 
zation phases, such as the prefetch insertion algorithm, 
register reassociation. and instruction scheduling. The new 
compiler yields significant performance improvements for 
some industry-standard performance benchmarks, for 
example on the SPEC92 and SPEC95 benchmarks on the 
Hewlett-Packard Company (Palo Alto. Calif.) PA-8000 pro- 
cessor. 

The following discussion explains compiler operation in 
the context of a loop within an application program. Loops 
are readily recognized as a sequence of code that is itera- 
tively executed some number of times. The sequence of such 
operations is predictable because the same set of operations 
is repeated for each iteration of the loop. It is common 
practice in an application program to maintain an index 
variable for each loop that is provided with an initial value, 
and that is incremented by a constant amount for each loop 
iteration until the index variable reaches a final value. The 
index variable is often used to address elements of arrays 
that correspond to a regular sequence of memory locations. 

In the compiler, it has been found that the low level 
optimizer component of a compiler is in a good position to 
deduce the number of cycles required by a stretch of code 
that is repetitively executed and this information can be used 
to determine the optimal unroll factor. As discussed above, 
the concept of loop unrolling is not new, but use of smart 
unrolling is new. For example, FIG. 4 shows the loop of FIG- 
3 after the loop has been unrolled four times. Thus, instead 
of executing the loop 100 times if N were 4. the loop is 
executed 25 times. 

FIG. 5 shows the output code that is generated by the 
invention in contrast to the code generated by the prior art. 
as shown on FIG. 4. The replicated inner loops 333 are the 
same. Also, the compensation loop 323 is the same as the 
prior art compensation loop 155 of FIG. 4. However, the 
loop test at 331 and 159 of FIG. 4 now tests to exit if I>=5 
rather than I>=4. as can be seen at 311 and 321 of FIG. 5. 
The effect is to ensure that the compensation loop is always 
executed at least once. This eliminates the need to test for the 
zero case (303 in FIG. 4). This eliminates the branch 
instruction 303 on FIG. 4. As indicated above, the elimina- 
tion of this branch instruction significantly increases the 
speed of the compiled code by reducing the number of 
branch instructions that get mispredicted. 

It is also possible to put the compensation code in front of 
the main loop, as is shown on FIG. 6. Here the compensation 



§ 

loop 383 is in front of the repetitive unrolled loops 347. In 
the general case, putting the compensation code before the 
main unrolled loop is less efficient than putting afterwards, 
because calculating the loop trip count requires a remainder 

5 operation which involves high latency divide operations. 
However, if the unroll factor is a power of two. as in this 
case where the unroll factor is 4. the remainder calculation 
is a simple shift operation. Because unroll factors of 2. 4. or 
8 are common, the compensation code can be placed in front 

10 in front of the unroll loop for negligible cost. As a practical 
matter, it is often advantageous to put the compensation loop 
in front of the unrolled loop to benefit from other optimi- 
zations such as register reassociation. When the compensa- 
tion loop is placed before the unrolled loop, the variable that 

is keeps track of the iteration count is not always needed after 
the unrolled loop. When the compensation loop is placed 
after the unrolled loop, this variable is always needed after 
the unrolled loop as there is an exposed use in the compen- 
sation loop. This exposed use can inhibit aggressive register 

20 reassociation. In the preferred embodiment, the architecture 
of the computer and the interactions with other optimiza- 
tions dictate an unroll factor, and if it is a power of two. the 
compensation code is inserted in front of the unroll loop. 
Another optimization technique that is part of the inven- 

25 tion herein disclosed rearranges loops with early exits 
(which are henceforth referred to as WHILE loops). These 
loops are characterized by the fact that some of the inner 
loop code is done before the loop test, and some after the 
loop test as is shown on FIG. la. Here the loop has an exit 

30 branch 4®3 in the middle with inner loop operations 4101 
before it. and other inner loop operations 405 after it. 

The optimization taught in the prior art for this loop is 
shown on FIG- lb. Notice that the whole inner loop, 
including the exit instruction, is replicated 377 four times. 

35 This can be improved by converting the unrolled loop into 
a FOR loop with an exit condition of (unroll factor +1) as 
opposed to unroll factor (e.g. S instead of 4 in this case), as 
is shown on FIG. 8. This guarantees that the unrolled loop 
is exited before it would have to exit due to the WHILE 

40 condition. Because none of the branches at 377 on FIG. 7b 
are executed, the WHILE exit instruction can be removed, as 
is shown at 319 on FIG. 8. Thus, there is only one place that 
there is a WHILE loop exit. i.e. at 365. 
The technique herein disclosed ensures thai the unrolled 

45 loop exits before it would take any of the WHILE loop exits, 
so that the WHILE test can be removed from the unrolled 
loop. It is necessary to ensure that the compensation code is 
always executed at least once. 
The following discusses how the unroll factor interacts 

50 with the scalar replacement optimization. This is particularly 
important because the form of this optimization determines 
the unroll factor. Consider the loop shown on FIG. 9. Notice 
that the value of All) stored in the inner loop at 367 is loaded 
again two loop iterations later, when the same statement 

55 loads A{I-2) with a value of I which is incremented by 2. The 
idea behind scalar optimization which is well known in the 
prior art, is to save array values in temporary variables if 
they are accessed shortly within the next few iterations. 
Thus, the loop can be modified as is shown on FIG. 10. 

60 Here, the array reference A|I-2 1 at 391 is replaced with T, 
and it is followed by two instructions at 393 and 395 which 
assign values to T and Tl. Two scalar temporary variables 
are necessary because the value of the two most recent array 
values must be saved. The value of A|M I would have been 

65 stored in the previous iteration in Tl and that is going to be 
used in the next iteration. We move Tl to T and T will be 
accessed In the next iteration. Similarly T I. to which A|I| is 



01/29/2004, EAST Version: 1.4.1 



5.797.013 



1© 



assigned will be moved to T in the next iteration and used 
2 iterations from now. 

The instruction at 395 appears to make an indexed refer- 
ence to the array A and that suggests that an array access 
must be made to get the number to put into Tl. which would 5 
be a high latency operation and would lose ail that was 
gained by the optimization. What actually happens is the 
optimizer recognizes that the value of Al is stored two 
instructions earlier at 391, and that A|Ii is resident in a 
register which can be stored into Tl without accessing A|I|. J0 
As Tl is likely to be assigned to a register, this operation is 
a register to register instruction. 

The foregoing illustrates how the various optimization 
techniques are interrelated allowing the loop unrolling opti- 
mizer to generate code which is clearly not optimal in itself 
but is optimized by other optimizers in the compiler. Ini- 
tialization code is Inserted at 388 to define initial values of 
TandTl. 

FIG. 11 shows how such a loop can be unrolled according 
to the prior art if an unroll factor of three had been chosen. 2Q 
The prior art does not specify the selection of an unroll 
factor of three here, but a factor of three, or a multiple of 
three is optimal because it allows other parts of the optimizer 
to generate the code shown on FIG. 12. The selection of an 
unroll factor here of three or a multiple of three is important 25 
because three values of the array must be kept A|I|. A|I-l| 
and A|I-2| if the array references are to be avoided. 

Three variables T. Tl and T2 are used, making it possible 
for other well known code optimization techniques to gen- 
erate code, such as that shown on FIG. 12. This eUroi nates 3Q 
shuttling the temporary data from T to Tl. This is an 
example of where the nature of the code in the loop and its 
effect on scalar optimization forces a particular unroll factor. 
In general, one lists the various indexes in the loops and sorts 
them to notice the maximum distance between them. 35 

In the above case the distance between the index L and 
1-2. is 2. Adding one to this value computes a primary unroll 
factor, which is three in this case. This is an acceptable 
unroll factor. However, if it turned out to be very small, one 
might want to multiply it by a constant to get a larger unroll ^ 
factor. Alternatively, if the primary unroll factor was very 
large, one might want to divide it by the loop increment of 
it was a number other than one. The reasons for selecting a 
particular unroll factor are discussed below. 

Determining the unroll factor. 45 

Classically the prior art uses a standard unroll factor for 
all loops. Typically the number used is four. In the invention, 
the unroll factor is calculated for each loop depending on 
various factors. At one extreme some loops are not unrolled 
at all, and other loops are unrolled eight or even more times. ^ 
The disadvantages of picking too large an unroll factor are: 

1. All the loop instructions need not fit into the instruction 
cache leading to a lot of I-cache misses. 

2. The higher the unroll factor, the higher the memory 
stride of memory instructions across iterations. If the 55 
memory stride exceeds cache line size, the effective- 
ness of prefetch decreases. 

3. The resulting code is longer. Usually an upper bound 
must be chosen, an unroll count of 1000 is not likely to 

be a good idea since the compile time can go up 60 
significantly. Also excessive unrolling can adversely 
affect other optimizations which have bounds on the 
number of transformations they can make. 

On the other hand if a small unroll factor is chosen the 
following problems can occur: 65 

1. Much more time is spent executing the high latency 
branch instruction which closes the loop. 



2. The short inner loop provides many fewer opportunities 
for optimization than longer inner loops. Where the 
inner loop has high latency instructions, the compiler 
can often have them execute in parallel with low 
latency time instructions. This may not be possible in 
very short loops. 
One must keep in mind mat the compiler is compiling 
loops that range from single instruction inner loops to loops 
that have scores or even hundreds of instructions, and so the 
compiler must compile code balance these considerations to 
achieve good unroll factors. To determine the unroll factor, 
the compiler considers the following in decreasing order of 
importance: 

1. There is a maximum value of the unroll factor (which 
in the preferred embodiment of the invention is eight). 

2. The number of instructions in the unrolled loop must 
not exceed a specific limit. This provides another upper 
bound to the unroll factor. 

3. If there are references to previous indexed contents of 
the array such as was shown in FIGS. 10 through 12. an 
unroll factor suggested by this analysts (or a multiple of 
it) should be used. 

4. If prefetch instructions are being generated (this is 
known based on a user defined flag), then try to pick an 
unroll factor that keeps the value of the strides of array 
references within the loop below the cache line size. 

5. If the trip count is a constant known at compile time, 
then an unroll factor that eliminates the need for a 
compensation code loop should be selected. Typically 
this would be an unroll factor of 2. 4 or 8. although 
other numbers such as 3 or 5 might be possible. 

6. If there is profile information, use that. If the profile 
informations says that the loop iterates on an average 
k times, if k is smaller than the maximum value of the 
unroll factor as dictated by the previous steps, use fc. 
else use the maximum value of the unroll factor. 

7. If there are high latency operations within the loop such 
as divide and square root operations, use an unroll 
factor that will enhance the maximum overlap of these 
instructions. For instance, if the architecture has two 
divide units and the loop has a single divide instruction, 
the loop should be unrolled an even number of times so 
that both the divide units can be kept simultaneously 
busy. 

The algorithm that computes the unroll factor tries to 
compute an optimal and acceptable unroll factor. The cost of 
a nonoptimai unroll factor is slower run time code. As 
discussed above, the algorithm is sensitive to profile data, 
number of instructions in the loop, architecture features like 
functional units and cache line size, interactions with other 
optimizations and constant trip counts. 

Attention is directed to FIG. 13. which shows how the 
optimization algorithms presented here are implemented. At 
201 the unroll factor is determined as described above. Next, 
at 203. a check is made for the special case where the trip 
count is known at compile time and is a multiple of the 
unroll factor. In this case, the unrolled loop code is generated 
from the original loop code, any middle exits are removed 
leaving only the final exit, and the unrolled code is output at 
242. Because this is an unrolled loop which needs no 
compensation code, none is output. The other exit occurs at 
203 where the trip count is not known at compile time, or the 
trip counts and unroll factor are such that compensation code 
must be generated. Control goes to 205 where the unrolled 
code is generated. For non-WHILE loops, the middle exits 
are removed leaving only the final exit. For a WHILE loop. 



01/29/2004, EAST Version: 1.4.1 



5.797,013 



11 

the final exit is moved to the end of the loop (226) instead 
of the middle and all other exits removed. At this time a 
determination (at 229) is made using the unroll factor to 
determine if the compensation code should be output before 
or after the unrolled loop code. If it should be after, control 
goes to 244. otherwise control goes to 246. All of these three 
control paths then meet at 999, terminating the unrolling 
optimization. 

FIG. 14 is a block schematic diagram of a compiler for a 
programmable machine in accordance with the invention. 
The compiler of FIG. 2b shows a loop unrolling module 3®. 
The preferred embodiment of the invention provides a loop 
unrolling module that is placed within the compiler as 
shown in FIG. 2b. As shown in FIG. 14. the compiler 
comprises an analysis module 1301 for analyzing and 
unrolling loops within source applications. An optimizer 
module 1302 determines an optimum unroll factor in 
response to the analysis module. An unroll module 1303 
generates an unrolled loop having said optimum unroll 
facto, while a compensation module 1304 generates and 
places any compensation code as required as a result of loop 
unroll optimization. 

Although the invention is described herein with reference 
to the preferred einbodiment one skilled in the art will 
readily appreciate that other applications may be substituted 
for those set forth herein without departing from the spirit 
and scope of the present invention. Accordingly, the inven- 
tion should only be limited by the claims included below. 

We claim: 

1. In a programmable machine, a compiler comprising: 
an analysis module for analyzing and unrolling loops 

within source applications; 
an optimizer module for determining an optimum unroll 

factor in response to said analysis module; 
an unroll module for generating an unrolled loop having 

said optimum unroll factor; and 
a compensation module for generating and placing any 

compensation code as required as a result of loop unroll 

optimization. 

wherein said compensation module performs a placement 
calculation to determine whether to put said compen- 
sation code before the unrolled loop or after the 
unrolled loop. 

2. The compiler of claim 1, wherein said compensation 
module ensures that said compensation code is executed at 
least once when said unrolled loop is executed. 

3. The compiler of claim 1. wherein said unroll factor is 
responsive to a number of instructions in the loop. 



12 

4. The compiler of claim 1. wherein said unroll factor is 
responsive to resource usage within the loop. 

5. The compiler of claim 1. wherein said unroll factor is 
responsive to prefetch distance of memory references within 

5 the loop. 

6. The compiler of claim I. wherein said unroll factor is 
responsive to profile information collected from previous 
executions of compiled code. 

7. The compiler of claim 1. wherein said unroll factor is 
10 responsive to recurrence of memory references within the 

loop. 

8. The compiler of claim 1. wherein said unroll factor is 
responsive to the number of instructions in the loop. 

15 9. The compiler of claim 1, wherein a trip count for the 
loop is known at compile time, and wherein said optimizer 
module determines an unroll factor that executes the com- 
pensation code zero times and suppresses the generation of 
said compensation code. 

20 10. The compiler of claim 1. wherein said placement 
calculation is responsive to the unroll factor computed for 
the loop. 

11. The compiler of claim 1. wherein said loop is a loop 
having an early exit. 
25 12. The compiler of claim 1. wherein loop unrolling is 
integrated with other low-level optimization phases. 

13. The compiler of claim 12. wherein said other low- 
level optimization phases include any of prefetch instruction 
insertion, register rcassociation. and instruction scheduling. 
30 14. A method for unrolling loops, comprising the steps of: 
detennining a unroll factor 

generating an unrolled loop which always exits leaving a 

remaining trip count of at least one; 
35 generating compensation code; 

detennining whether the compensation code should be 

placed before the unrolled loop or after the unrolled 

loop; and 

determining if the trip count is a power of two; and if it 
40 is placing the compensation code before the unrolled 
loop. 

15. The method of claim 14, wherein the loop to be 
unrolled is a loop having an early exit. 

16. The method of claim 15, wherein the loop having an 
45 early exit after unrolling is transformed into a loop having an 

exit at its end. and from which all intermediate exits have 
been removed. 

* # * * * 



01/29/2004, EAST Version: 1.4.1 



United States Patent im 

Mahadevan et al. 



HIIIIIIIIIIIII 



US0057970I3A 
[ii] Patent Number: 
[45] Date of Patent: 



5,797,013 
Aug. 18, 1998 



[54] INTELLIGENT LOOP UNROLLING 

[75] Inventors: Uma Mahadevan. Sunnyvale; Lacky 
Shah, Fremont, both of Calif. 

[73] Assignee: Hewlett-Packard Company, Palo Alto. 
Calif. 



[21] Appl. No.: 564,514 

[22] Filed: Nov. 29, 1995 

[51] IntCl. 6 - G06F9/45 

[52] U.S, a 395/709; 395/588 

[58] Field of Search 395/705. 709. 

395/588. 580 

[56] References Cited 

U.S. PATENT DOCUMENTS 

5,265.253 11/1993 Yamada _ - 395/700 

5367,651 11/1994 Smith el al 395/700 

5386.562 1/1995 Jain et al 395/650 

OTHER PUBLICATIONS 

"A Comparative Evaluation of Software Techniques to Hide 
Memory Latency", John et al.. Proc. of the 28 rt Ann. Hawaii 
IntU Conf.. 1995. pp- 229-238. 

"Schedule driven Loop Unrolling for Parallel Processors". 
System Sciences, 1991 Annual Hawaii Int'l Conference. 
1991. vol n pp. 458-467. 



"Aggressive Loop Unrolling in a Retargetable. Optimizing 
Compiler". Davidson et al., Dept of Comp. Science. Univ. of 
Va. pp. 1-14. 

"Unrolling Loops in Fortran." Dongarra et al.. Soft. Practice 

and Experience, vol. 9, 1979. pp. 219-226. 

Headren et al.. "Designing Programming Languages for the 

Analyzabiliry of Pointer Data Structures. 1 * Comput. Lang.. 

vol. 19, No. 2. pp. 119-134 (1993). 

Weiss et al.. **A Study of Scalar Compilation Techniques for 

Pipelined Supercomputers," ACM, pp. 105-109 (1987). 

Primary Examiner— Emanuel Todd Voeltz 
Assistant Examine r — Kakali Chaki 



[57] 



ABSTRACT 



A compiler facilitates efficient unrolling of loops and 
enables the elimination of extra branches from the loops, 
including the elirnination of conditional branches from 
unrolled loops with early exits. Unrolling also enhances 
other optimizations, such as prefetch, scalar replacement, 
and instruction scheduling. The unroll factor is calculated to 
determine the amount of loop expansion and the optimum 
location to place compensation code to complete the original 
loop count, i.e. before or after the unrolled loop. The 
compiler is applicable, for example, to modern RISC 
architectures, where the latency of memory references and 
branches is higher than that of integer and floating point 
arithmetic instructions. 

16 Claims, 13 Drawing Sheets 



2*30 



Loops 



Analysis LA 
Module 



r 



Optimizer 
Module 



r 



Unroll 
Module 



J 



Compensation j 
Module 



1301 



1302 



r 



1303 



^1304 



t 

Unrolled 
Loops 



01/29/2004, EAST Version: 1.4.1 



t 

U.S. Patent Aug. is, 1998 stmt 1 of 13 5,797,013 





15 



13 



MEMORY 



14 




Fig. 1 (PRIOR ART) 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Aug. 18, 1998 Sheet 2 of 13 5,797,013 




01/29/2004, EAST Version: 1.4.1 



U.S. Patent Aug. is, ms 
120 



Sheet 3 of 13 



5,797,013 



24 



LLO 




global 
(optimization 




prefetch analysis 
and code generation 




/^'instruction scheduling 
\^ + register allocation 



120 




LLIR 



Fig. 2b 

(PRIOR ART) 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Aug. 18, 1998 Sheet 4 of 13 5,797,013 



I=N 

BEGIN_LOOP: 

A[I] = B[I] + X 
1 = 1-1 

IF (I>0) GOTO BEGIN.LOOP; 



150 



END_LOOP: 
PRINT I 



Fig. 3 

(PRIOR ART) 



1 = N 

BE GIN_LOOP: 

A[I] = B[I] + K 
1 = 1-1 



401 



IF ( 1 < 0) GOTO END_LOOP 

-405 



C[I] = A[l] + X 



GOTO BEGIN_LOOP 
END.LOOP: 

PRINT I 



403 



Fig. 7a 

°(PRIOR ART) 



BEGIN LOOP: 



A[I] = A[I-2]/B[I] 
1 = 1 +1; 



367 



IF (I < N) GOTO BEGIN_LOOP 
END_LOOP: 

Fig. 9 

(PRIOR ART) 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Aug. 18, 1998 Sheet 5 of 13 5,797,013 



301 



I = N 

IF (I<4) GOTO BEGIN_COMPENSATTON_LOOP 
UNROLLED.LOOP: 



A[I] = B[I] + X 
1 = 1-1 



A[I]=B[I]+X 
1 = 1-1 



A[I] = B[I] + X 
1 = 1-1 



A[I]=B[I]+X 
1 = 1-1 




333 




159 



IF (I >= 4) GOTO UNROLLEDJLOOP 
END_UNROLLED_LOOP: ,-303 

IF (I = 0) GOTO END_COMPENSATION_LOOP^ 
BEGIN_COMPENSATION_LOOP: 



A[I] = B[I] + X 
1 = 1-1 

IF (I>0) GOTO BEGIN_COMPENSATION_LOOP; 



END_COMPENSATION_LOOP: 
PRINT I 

Fig. 4 

^(PRIOR ART) 




155 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Aug. is, im sheet 6 of 13 5,797,013 



311 



I = N 

IF (1 < 5) GOTO BEGIN_COMPENSATION_LOOP 
UNROLLED_LOOP: 



A[I] = B[I] + X 
1 = 1-1 



A[I] = B[I] + X 
1=1-1 



A[I] = B[I] + X 
1 = 1-1 



A[I]=B[I] + X 
1 = 1-1 




IF (I >= 5) GOTO UNROLLEDJLOOP 
END_UNROLLED_LOOP: 
COMPENSATION_LOOP: 




321 




323 



A[I] = B[I] + X 
1 = 1-1 

IF (I>0) GOTO BEGIN_COMPENSATION_LOOP; 



END_COMPENSATION_LOOP: 
PRINT I 



Fig. 5 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Aug. 18, 1998 Sheet 7 of 13 



j = ( (N-l) %) 4 + 1 NOTE: ( (N-=l) MOD 4) + 1 

I = N - J • NOTE THAT I IS GUARANTEED 

TO BE DIVISIBLE BY 4 HERE. < ' : 

COMPENSATIONJLOOP: 



ALJ] = B[J] + X 

IF (J > 0) GOTO BEGIN_COMPENSATION_LOOP: 



END_COMPENSATION_LOOP: 
IF (I = 0) GOTO END_UNROLLED_LOOP 
UNROLLED_LOOP: 



A(I]=B[I] + X 
1 = 1-1 



A[I] = B[I] + X 
1 = 1-1 



A[1]=B[I]+X 
1 = 1-1 



A[I]=B[I] + X 




1 = 1-1 



IF (I > 0) GOTO UNROLLED.LOOP 
END_UNROLLED_LOOP: 
PRINT I 

Fig. 6 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Aug. 18, 1998 



Sheet 8 of 13 



5,797,013 



I = N 

BEGIN_LOOP: 
UNROLLED_LOOP: 



A[I] = B[I] + K 
1 = 1-1 

IF ( I < 0) GOTO END_COMPENSATION_LOOP; 
C[I] = A|I] +X 



A[I] = B[I] + K 
1 = 1-1 

IF ( 1 < 0) GOTO END_COMPENSATION_LOOP; 
C[I] = A[I] + X 



A[IJ = B[I] + K 
1 = 1-1 

IF ( I < 0) GOTO END_COMPENSATION_LOOP; 
C[I] = A[I]+X 




A[I] = B[I] + K 
1 = 1-1 

IF ( I < 0) GOTO END_COMPENSATION_LOOP; 
C[I] = A[I]+X 



END_UNROLLED_LOOP: 
BEGIN_COMPENSATION_LOOP: 



A[I] = B[I] + K 
1 = 1-1 



365 



IF (I < 0) GOTO END_COMPENSATION_LOOP; 



C[II = A[I] + X 



GOTO BEGIN_COMPENSATION_LOOP 
END_LOOP: 
PRINT I 

Fig. 7b 

(PRIOR ART) 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



I = N 



Aug. 18, 1998 



Sheet 9 of 13 



5,797,013 
311 



IF (I<5) GOTO BEGIN_COMPENSATION_LOOP 
UNROLLED LOOP: 



A[I] = B[I] + K 

1 = 1-1 

C[I] = A[I] + X 



A[I] = B[I] + K 

1 = 1-1 

C[I] = A[I] + X 



A[I] = B[I] + K 

1 = 1 - 1 

C[I] = A[I] + X 



A[I] = B[I] + K 

1 = 1 - 1 

C[I] = A[I] + X 




319 



312 



IF (I>=5) GOTO UNROLLED_LOOP; 
END_UNROLLED_LOOP: 
BEGIN_COMPENSATION_LOOP: 

361 



A[IJ = B[I] + K 
1 = 1-1 




365 



IF (I < 0) GOTO END_COMPENSATION_LOOP; 

363 



C[I] = A[I] + X 



GOTO BEGIN_COMPENSATION_LOOP 
END_LOOP: 
PRINT I 



Fig. 8 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



T = A[I-2] 
Tl = Afl-11 



Aug. 18, 1998 Sheet 10 of 13 5,797,013 

388 

Fig. 10 

(PRIOR ART) 



= T/B[I1 *f 



391 



A[I1 = T/B[I1 
T = T1 <r- 
Tl = A [I] <- 
1 = 1+1; 

IF (I < N) GOTO BEGIN_LOOP 



393 

395 



END LOOP: 



T = A[I-2] 
Tl = A[I-1] 

BEGIN LOOP: 



A[I] = T/B[I] 

T = T1 < 

T1 = A[I] 
1 = 1+1; 



A[I] = T/B[I] 

T = T1^ 

Tl = A[I] 
1 = 1+1; 



A[I] = T/B[I] 

T = T1^ 

Tl = A[I] 
1 = 1+1; 



388 



Fig. 1 1 

391 (PRIOR ART) 

W93 
395 




391 
393 

395 



IF (I < N) GOTO BEGlN_LOOP 
ENDLOOP: 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent Aug. 18, 1998 Sheet 11 of 13 5,797,013 



T = A[I-2] 
Tl = A[I-1] 



400 



BEGIN LOOP: 



A[I]=T/B[I] 
T2 = A[I] 



A[I+1] = T/B[I+1] 
T = A[I+1] 



401 



402 



A[I+2] = T2 / B [1+2] \f~ 403 
Tl = A [1+2] 



410 



1 = 1 + 3; 
IF (I < N) GOTO BEGIN_LOOP 
END LOOP: 



Fig. 12 



01/29/2004, EAST Version: 1.4.1 



U.S. Patent 



Aug. 18, 1998 



Sheet 12 of 13 



5,797,013 



201 



DETERMINE TR1PCOUNT IF KNOWN 

DETERMINE UNROLL FACTOR FROM LOOPLENGTH.PREFETCH 
SPAN,PROFILE,RESOURCE USAGE 



I 



IF TRIPCOUNT KNOWN AND TR1PCOUNT /203 
MOD UNROLLF ACTOR = 0 



222 



I— I 



YES 



NO 



ULCODE= 

UNROLL(LOOP,UNROLLFACTOR,NOCOMP) 

REMOVEMIDDLEEXITS (ULCODE) 
EXCEPT THE FINAL EXIT 



5 



205 



ULCODE= 

UNROLL(LOOP,UNROLLFACTOR,COMPOK) 
REMOVEMIDDLEEXITS (ULCODE) 



5 



242 



OUTPUT(ULCODE) 



NO 




226 



ULCODE=MOVE FINAL 
EXIT TO END OF LOOP 
(ULCODE) 



244 



5 



229 



IF PLACECOMP( UNROLL FACTOR) = BEFORE 



I 



US 



246 



OUTPUT ULCODE 
OUTPUT COMPCODE FOR LOOP 



I 



OUTPUT COMPCODE FOR LOOP 
OUTPUT ULCODE 



999 



DONE 



Fig. 13 



01/29/2004, EAST Version: 1.4.1 



VS. Patent Aug. 18, 1998 Sheet 13 of 13 5,797,013 



24 



30 



Loops 



Analysis 
Module 



r 



Optimizer J 
Module 



r 



Unroll 
Module 



i 



Compensation 
Module 



1301 



1302 



1303 



r 



1304 



▼ 

Unrolled 
Loops 



Fig. 14 



01/29/2004, EAST Version: 1.4.1 



5.79X013 



1 

INTELLIGENT LOOP UNROLLING 

BACKGROUND OF THE INVENTION 

1. Technical Field 

The invention relates to compilers. More particularly, the 
invention relates to techniques for unrolling computation 
loops in a compiler so as to generate code which executes 
faster. 

2. Description of The Prior Art 

FIG. 1 is a block schematic diagram of a uniprocessor 
computer architecture, including a processor cache. In the 
figure, a processor 11 includes a cache 12 which is in 
communication with a system bus 15. A system memory 13 
and one or more I/O devices 14 are also in communication 
with the system bus. 

FIG. 2a is a block schematic diagram of a software 
compiler 20, for example as may be used in connection with 
the computer architecture shown in FIG. 1. The compiler 
Front End component 21 reads a source code file (100) and 
translates it into a high level intermediate representation 
(110). A high level optimizer 22 optimizes the high level 
intermediate representation 110 into a more efficient form. A 
code generator 23 translates the optimized high level inter- 
mediate representation to a low level intermediate represen- 
tation (120). The low-level optimizer 24 converts the low 
level intermediate representation (120) into a more efficient 
(machine-executable) form. Finally, an object file generator 
25 writes out the optimized low-level wtcrracdiatc repre- 
sentation into an object file (141). The object file (141) is 
processed along with other object files (140) by a linker 26 
to produce an executable file (150), which can be run on the 
computer 10. In the invention described herein, it is assumed 
that the executable file (150) can be instrumented by the 
compiler (20) and linker (26) so that when it is run on the 
computer 10, an execution profile (160) may be generated 
which can then be used by the low level optimizer 24 to 
better optimize the low-level intermediate representation 
(120). The compiler 20 is discussed in greater detail below. 

The compiler is the piece of software (hat translates 
source code, such as C, BASIC, or FORTRAN, into a binary 
image that actually runs on a machine. Typically the com- 
piler consists of multiple distinct phases, as discussed above 
in connection with FIG. 2a. One phase is referred to as the 
front end, and is responsible for checking the syntactic 
correctness of the source code. If the compiler is a C 
compiler, it is necessary to make sure that the code is legal 
C code. There is also a code generation phase, and the 
interface between the front-end and the code generator is a 
high level intermediate representation. The high level inter- 
mediate representation is a more refined series of instruc- 
tions that need to be carried out. For instance, a loop might 
be coded at the source level as: 

/o/<y=o,/<ioj=y+i), 

which might in fact be broken down into a series of steps, 
eg. each time through the loop, first load up I and check it 
against 10 to decide whether to execute the next iteration. 

A code generator takes this high level intermediate rep- 
resentation and transforms it into a low level intermediate 
representation. This is closer to the actual instructions that 
the computer understands. An optimizer component of a 
compiler must preserve the program semantics (i.e. the 
meaning of the instructions that are translated from source 
code to an high level intermediate representation, and thence 
to a low level intermediate representation and ultimately an 



2 

executable file), but rewrites or transforms the code in a way 
that allows the computer to execute an equivalent set of 
instructions in less time. 

Modern compilers are structured with a high level opti- 

5 mizer (HLO) that typically operates on a high level inter- 
mediate representation and substitutes in its place a more 
efficient high level intermediate representation of a particu- 
lar program that is typically shorter. For example, an HLO 
might eliminate redundant computations. With the low level 

10 optimizer (LLO). the objectives are the same as the HLO, 
except that the LLO operates on a representation of the 
program that is much closer to what the machine actually 
understands. 

FIG. 2b is a block diagram showing a low level optimizer 
15 for a compiler, including a loop unrolling component 30 
according to the invention. The low level optimizer 24 may 
include any combination of known optimization techniques, 
such as those that provide for local optimization 35. global 
optiniization 36. loop identification 37. loop invariant code 
20 motion 38. prefetch 34 .register reassociation 31. and instruc- 
tion scheduling 32. 

Source programs translated into machine code by com- 
pilers consists of loops, e.g. DO loops, FOR loops, and 
WHILE loops. Optimizing the compilation of such loops 
25 can have a major effect on the run time performance of the 
program generated by the compiler. In some cases, a sig- 
nificant amount of time is spent doing such bookkeeping 
functions as loop iteration and branching, as opposed to the 
computations that are rxrformed within the loop itself. 
30 These loops often implement scientific applications that 
manipulate large arrays and data instructions, and run on 
high speed processors. 

This is particularly true on modern processors, such as 
RISC architecture machines. The design of these processors 
33 is such that in general the arithmetic operations operate a lot 
faster than memory fetch operations. This mismatch 
between processor and memory speed is a very significant 
factor in limiting the performance of microprocessors. Also, 
branch instructions, both conditional and unconditional. 
40 have an increasing effect on the performance of programs. 
This is because most modern architectures are super- 
pipelined and have some sort of a branch prediction algo- 
rithm implemented. The aggressive pipelining makes the 
branch misprediction penalty very high. Arithmetic instruc- 
45 lions are interregister instructions that can execute quickly, 
while the branch instructions, because of mispredictions, 
and memory instructions such as loads and stores, because 
of slower memory speeds, can take a longer time to execute. 
Modern compilers perform code optimization. Code opti- 
50 mization consists of several operations that improve the 
speed and size of the complied code, while maintaining 
semantic equivalence. Common optimizations include: 
prefetching data so that they are available in cache 
memory when needed; 
55 detecting calculations as computing constants and per- 
forming the calculation at compile time; 
scalar replacement which keeps the value of a variable in 
a register within the loop; 
60 moving calculations outside of loops where possible; and 
performing code scheduling, which consists of rearrang- 
ing the order of and modifying instructions to achieve 
faster running but semantically equivalent code. 
Many modern compilers also employ an optimizing tech- 
65 nique known as loop unrolling to generate faster running 
code. In its essence, loop unrolling takes the inner loop, i.e. 
the code between the beginning and the end of the loop, and 



01/29/2004, EAST Version: 1.4.1 



5.797.013 



3 

repeats it in the inner loop some number of times, e.g. four 
times. It then executes the unrolled loop one-fourth as many 
times as it would have executed the original loop. The 
number of times the loop is replicated within the unrolled 
loop is called the unroll factor. Because the number of times 
the original loop is executed is not always divisible by the 
unroll factor, a compensation loop code often has to be 
generated to execute the remaining of instructions of the 
original loop that arc not executed by the unrolled loop. 

As discussed above, such loops as DO, FOR and WHILE 
loops are common in programs, especially in scientific and 
other time-consuming programs. Frequently 80% of the 
running time of a program can be in a few small loops. As 
a result anything that can speed up such loops is of great 
value in making a more efficient compiler. 

Consider the simple loop shown on FIG. 3. The three 
instructions in the inner loop 150 are executed N times. 
According to the prior art, this loop can be unrolled with an 
unroll factor of four to produce the code shown on FIG. 4. 
where the inner loop (less the exit condition) is replicated 4 
times 333. This loop is also followed by a test 303 to see if 
the full original loop has been completed, and a compensa- 
tion loop 155 which is executed to the complete the original 
loop trip count if it has not been completed. 

An inspection of the loop shows that it is semantically 
equivalent to the loop of FIG. 3 because the same function 
is performed and the same result is achieved. However, the 
loop of FIG. 4 runs much faster for several reasons. First, the 
conditional branch 159 which exits the loop is executed only 
once for each time through the unrolled loop rather than 
once for each time through the original loop. Assuming an 
unroll factor of four as is shown here, this saves ^ of a 
conditional branch per original loop iteration. More 
importantly, other compiler optimizations interact with loop 
unrolling and are able to do a much better job of optimizing 
the unrolled loop, such as that identified by numeric desig- 
nator 333, as compared to the original loop, identified by 
numeric designator 150. In an unrolled loop, there are more 
operations that could be scheduled in parallel, more oppor- 
tunity to do scalar replacement and other optimizations, and 
more possibilities to do prefetching. 

The tests identified by numeric designators 301 and 303 
are also of interest. These are the conditional branches which 
have a higher probability of being mispredicted. Anything 
that can be done to eliminate one of them will be very useful. 

Prior art loop unrolling techniques have certain disadvan- 
tages. For example. J. J. Dongara and A. R. Hinds, Unrolling 
Loops in FORTRAN, describes how one can unroll loops 
manually by duplicating code. This is an early solution to 
optimizing code that it is not even implemented by the 
compiler. 

S. Weiss and J. E. Smith, A Study of scalar compilation 
techniques for pipeline supercomputers * discuss unrolling in 
the compiler. Here the authors address the simple situations: 

a) Cases where the loop count is known at compile time. 
They do not address loop unrolling when the loop count 
is only known at run time. 

b) Cases where the loop exit appears only at the beginning 
or end of the loop. They do not address the situation of 
unrolling loops with early exits (loops whose exit may 
occur in the middle of the loop). 

The authors do not address the following issues: 

a) Determining whether to place the compensation code 
before or after the unrolled loop. 

b) Tuning of the iteration count to reduce branch mispre- 
diction. 

c) Factors that affect the unroll factor. 



4 

L. J. Hendren and G. R. Gao. Designing Programming 
Languages for the Analyzability of Pointer Data Structures. 
addresses the issue of unrolling loops as part of compiler 
optimization, They do not however discuss: 
5 a) Unrolling loops with early exits; 

b) Tuning of the iteration count to reduce branch mispre- 
diction; 

c) Factors that affect the unroll factor; 

10 d) Whether to place a compensation code at the beginning 
or end of the loops; and 
e) Compiling loops whose trip count is only known at run 
time. 

J. Davidson, S. Jinturkar. Aggressive Loop Unrolling in a 
15 Retargetablc t Optimizing Compiler. Dept. of Computer 
Science, Thornton Hall. University of Virginia disclose a 
code transformation, referred to as aggressive loop 
unrolling, in a retargetable optimizing compiler where the 
loop bounds are not known at compile time. Various factors 
20 were analyzed to determine how and when loop unrolling 
should be applied, resulting in an algorithm for loop unroll- 
ing in which execution-time counting loops (i.e. a counting 
loop whose iteration count is not trivially known at compile 
time) are unrolled and loops having complex control -flow 
25 are unrolled. However, they do not discuss: 

a) Unrolling loops having early exits; 

b) Tuning of the iteration count to reduce branch mispre- 
diction 

c) Factors that affect the unroll factor; and 

30 d) Whether to place a compensation code at the beginning 
or end of the loops. 
Another part of the loop unrolling prior art is shown on 
FIG. 7, which illustrates a loop having an early exit (also 
referred to as a WHILE loop), consisting of an exit test and 

35 branch 403 in the middle of the loop between computations 
401 and 405. According to the prior art. the code, including 
the exit 403, is replicated four times in an unrolled loop. The 
number of branches in the loop are however not reduced 
with this optimization. 

40 While some of the techniques discu ssed in the prior art are 
applicable to compilers for all computers, only some of them 
are particularly applicable for modern RISC computers, 
where branch instructions form a lot bigger bottleneck than 
in earlier technologies. Also compilers for RISC architec- 

45 hires are a lot more aggressive and the interactions of 
various optimizations plays a key role in the quality of the 
final code. 

SUMMARY OF THE INVENTION 

50 The invention provides a new compiler that can unroll 
more loops than previous algorithms. It also significantly 
reduces the number of branch instructions by cleverly han- 
dling the iteration count and by converting loops with early 

55 exits to regular FOR loops. The invention also provides for 
computing the unroll factor and the placement of the com- 
pensation loop by taking a lot of other optimizations into 
consideration. 
The compiler: 

60 Eliminates time consuming conditional branch instruc- 
tions from the compensation code loop by replacing the 
conditional exit of the main unrolled loop to always 
exit with at least one iteration which has yet to be 
executed by the compensation code. This eliminates the 

65 need to test for zero remaining loops. 

Determines whether it is better to place the compensation 
code at the beginning or the end of the unrolled loop 



01/29/2004, EAST Version: 1.4.1 



5.797.013 



according to which one would likely provide the better 
optimization. Generally, it prefers to put the compen- 
sation loop in front of the main loop if the unroll factor 
is a power of two and after the main loop if the unroll 
factor is not a power of two. 

Computes the unroll factor by taking into account the 
interactions of other optimizations like prefetch, scalar 
replacement and register allocation, and also taking 
into account hardware features like number of func- 
tional units. Unrolling loops over-aggressively or 
under-aggressively can inhibit other optimizations or 
make them less effective. 

Converts loops with early exit to loops with exit at the end 
to apply more efficient optimizations to the loop. It does 
this by ensuring that the compensation code is always 
executed at least once, enabling the compiler to elimi- 
nate the exit tests from the unrolled loop. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 shows a block schematic diagram of a uniprocessor 
computer architecture including a processor cache; 

FIG. 2a shows a block schematic diagram of a modern 
software compiler; 

FIG. 2b shows schematic diagram of the low level opti- 
mizer; 

FIG. 3 shows a simple program loop; 

FIG. 4 shows the loop of FIG. 3 that has been unrolled 
four times according to the prior art; 

FIG. 5 shows the loop of FIG. 3 unrolled according to the 
invention and having an eliminated branch; 

FIG. 6 shows an unrolled loop having pre-loop compen- 
sation code; 

FIG. la shows a simple WHILE loop; 

FIG. lb shows the WHILE loop of FIG. la that has been 
unrolled according to the prior art; 

FIG. 8 shows the WHILE loop of FIG. la that has been 
unrolled according to the invention; 

FIG. 9 shows a schematic representation of a simple loop; 

FIG. 10 shows the loop of FIG. 9 with scalar replacement 
according to the prior art; 

FIG. 11 shows the loop of FIG. 10 unrolled; 

FIG. 12 shows a schematic representation of the loop of 
FIG. 11 after copy elimination; 

FIG. 13 shows a schematic representation of the compiler 
logic which determines the unroll factor, compensation code 
placement and other optimizations; and 

FIG. 14 is a block schematic diagram of a compiler for a 
programmable machine in accordance with the invention. 

DETAILED DESCRIPTION OF THE 
INVENTION 

The invention provides a new compiler that features smart 
unrolling of loops. 

The invention provides a prefetch driver 34 that operates 
in concert with such known techniques. The following 
discussion pertains to the various elements of the low level 
optimizer shown on FIG. 2b: 

Local optimizations include code improving transforma- 
tions that are applied on a basic block by basic block basis. 
For purposes of the discussion herein, a basic block corre- 
sponds to the longest contiguous sequence of machine 
instructions without any incoming or outgoing control 
transfers, excluding function calls. Examples of local opti- 



25 



mizations include local common sub-expression elimination 
(CSE). local redundant load elimination, and peephole opti- 
mization. 

Global optimizations include code improving transforma- 
3 tions that are applied based on analysis that spans across 
basic block boundaries. Examples include global common 
sub-expression elimination, loop invariant code motion, 
dead code elimination, register allocation and instruction 
scheduling. 

10 Loop invariant code motion is the identification of 
instructions located with a loop that compute the same result 
on every loop iteration and the re-positioning of such 
instructions outside the loop body. 

Register allocation and instruction scheduling is the pro- 
15 cess of assigning hardware registers to symbolic instruction 
operands and the re-ordering of instructions to minimize 
run-time pipeline stalls where the processor must wait on a 
memory fetch from main memory or wait for the completion 
of certain complicated instructions that take multiple cycles 
20 to execute (eg. divide, square root instructions). 

One important phase of the compiler identifies loops and 
access patterns to estimate how many cycles are devoted to 
loop iterations. In the invention, the compiler translates the 
higher level application code into an instruction stream that 
the processor executes, and in the process of this translation 
the compiler unrolls loops. 

The longer unrolled loops allow the compiler to provide 
several advantages, such as: 
30 1) It eliminates the extra branch exits. This saves CPU 
cycles by not having to execute the branch instructions 
and also helps reduce branch misprediction. Why this is 
important is evident if one considers most modern 
RISC architectures. These architectures have a long 
35 pipeline that is fed by an instruction fetch mechanism. 
When the fetch mechanism encounters a branch, it tries 
to predict if the branch is going to be taken or not. It 
then fetches instructions based on this prediction. The 
prediction is necessary to keep the pipeline from stall- 
40 ing. If the architecture's prediction is correct (this is 
determined when the branch instruction completes 
execution which is a few cycles after it has been 
fetched), then everything works fine; else all the 
instructions that have been fetched after the branch are 
45 discarded and the new instructions fetched based on the 
correct outcome of the branch. This penalty of discard- 
ing fetched instructions and fetching new ones, when 
the branch is mispredicted, is known as the branch 
misprediction penalty and it is very significant for most 
50 modern architectures. It is of the order of 5-10 cycles 
per branch instruction that is mispredicted. By reducing 
the branch instructions, the number of branches that get 
mispredicted automatically reduces. 
2) It can better insert prefetches and effect other optimi- 
55 zations into the longer inner loop code. When the loop 
is unrolled, there arc more memory instructions in the 
loop and also the memory stride (the distance between 
the memory accesses of an instruction in two consecu- 
tive iterations) is bigger. If a loop is unrolled four times, 
60 the memory stride goes up by four. This helps the 
prefetch to do a more effective job. When the memory 
stride increases, as long as it is less than the cache line 
size (which is architecture dependent), the prefetches 
become more effective. When the memory stride 
65 becomes greater than the cache line size the prefetches 
can hurt Hence the loops should be unrolled such that 
the memory stride is lesser than the cache line size 



01/29/2004, EAST Version: 1.4.1 



5.797.013 



7 

whenever prefetch instructions are going to be gener- 
ated for the loop and whenever possible. 

3) Scalar replacement/recurrence elimination inserts cop- 
ies at the loops to keep the value of a variable live in 
the next iteration (see. for example FIGS. 9. 10, and 
11). These copies can be eliminated by unrolling the 
loop a certain number of times. 

4) Longer sequences of non-branching instructions can 
achieve an overlap between instructions that have noth- 
ing to do with memory and those that do. This is known 
as instruction scheduling and explained below. While 
the access time between the processor and the cache is 
typically 1 to 5 cycles, the retrieval time from cache to 
memory is often on the order of 10 to 100 cycles. When 
the processor actually gets to the point where the data 
item is needed from memory, if the data is not in cache, 
it might take 100 processor cycles to fetch it from main 
memory. Where the compiler can optimize the longer 
inner loop code, it may only be necessary to wait for 20 
cycles because 80 cycles worth of look up time is 
hidden or overlapped with the execution of other 
instructions. 

Loop unrolling is integrated with other low level optimi- 
zation phases, such as the prefetch insertion algorithm, 
register reassociation. and instruction scheduling. The new 
compiler yields significant performance improvements for 
some industry-standard performance benchmarks, for 
example on the SPEC92 and SPEC95 benchmarks on the 
Hewlett-Packard Company (Palo Alto. Calif.) PA-8000 pro- 
cessor. 

The following discussion explains compiler operation in 
the context of a loop within an application program. Loops 
are readily recognized as a sequence of code that is itera- 
tively executed some number of times. The sequence of such 
operations is predictable because the same set of operations 
is repeated for each iteration of the loop. It is common 
practice in an application program to maintain an index 
variable for each loop that is provided with an initial value, 
and that is incremented by a constant amount for each loop 
iteration until the index variable reaches a final value. The 
index variable is often used to address elements of arrays 
that correspond to a regular sequence of memory locations. 

In the compiler, it has been found that the low level 
optimizer component of a compiler is in a good position to 
deduce the number of cycles required by a stretch of code 
that is repetitively executed and this information can be used 
to determine the optimal unroll factor. As discussed above, 
the concept of loop unrolling is not new, but use of smart 
unrolling is new. For example, FIG. 4 shows the loop of FIG. 
3 after the loop has been unrolled four times. Thus, instead 
of executing the loop 100 times if N were 4, the loop is 
executed 25 times. 

FIG. 5 shows the output code that is generated by the 
invention in contrast to the code generated by the prior art, 
as shown on FIG. 4. The replicated inner loops 333 are the 
same. Also, the compensation loop 323 is the same as the 
prior art compensation loop 155 of FIG. 4. However, the 
loop test at 301 and 159 of FTG. 4 now tests to exit if I>=5 
rather than t>=4. as can be seen at 311 and 321 of FIG. 5. 
The effect is to ensure that the compensation loop is always 
executed at least once. This eliminates the need to test for the 
zero case (303 in FIG. 4). This eliminates the branch 
instruction 303 on FIG. 4. As indicated above, the elimina- 
tion of this branch instruction significantly increases the 
speed of the compiled code by reducing the number of 
branch instructions that get mispredicted. 

It is also possible to put the compensation code in front of 
the main loop, as is shown on FIG. 6. Here the compensation 



8 

loop 383 is in front of the repetitive unrolled loops 347. In 
the general case, putting the compensation code before the 
main unrolled loop is less efficient than putting afterwards, 
because calculating the loop trip count requires a remainder 

5 operation which involves high latency divide operations. 
However, if the unroll factor is a power of two. as in this 
case where the unroll factor is 4. the remainder calculation 
is a simple shift operation. Because unroll factors of 2. 4. or 
8 are common, the compensation code can be placed in front 

10 in front of the unroll loop for negligible cost. As a practical 
matter, it is often advantageous to put the compensation loop 
in front of the unrolled loop to benefit from other optimi- 
zations such as register reassociation. When the compensa- 
tion loop is placed before the unrolled loop, the variable that 

is keeps track of the iteration count is not always needed after 
the unrolled loop. When the compensation loop is placed 
after the unrolled loop, this variable is always needed after 
the unrolled loop as there is an exposed use in the compen- 
sation loop. This exposed use can inhibit aggressive register 

20 reassociation. In the preferred embodiment, the architecture 
of the computer and the interactions with other optimiza- 
tions dictate an unroll factor, and if it is a power of two. die 
compensation code is inserted in front of the unroll loop. 
Another optimization technique that is part of the inven- 

25 tion herein disclosed rearranges loops with early exits 
(which are henceforth referred to as WHILE loops). These 
loops are characterized by the fact that some of the inner 
loop code is done before the loop test, and some after the 
loop test as is shown on FIG. la. Here the loop has an exit 

30 branch 403 in the middle with inner loop operations 401 
before it, and other inner loop operations 405 after it. 

The optimization taught in the prior art for this loop is 
shown on FIG. 7b, Notice that the whole inner loop, 
including the exit instruction, is replicated 377 four times. 

35 This can be improved by converting the unrolled loop into 
a FOR loop with an exit condition of (unroll factor +1) as 
opposed to unroll factor (e.g. 5 instead of 4 in this case), as 
is shown on FIG. 8. This guarantees that the unrolled loop 
is exited before it would have to exit due to the WHILE 

40 condition. Because none of the branches at 377 on FIG. lb 
are executed, the WHILE exit instruction can be removed, as 
is shown at 319 on FIG. 8. Thus, there is only one place that 
there is a WHILE loop exit, i.e. at 3*5. 
The technique herein disclosed ensures that the unrolled 

45 loop exits before it would take any of the WHILE loop exits, 
so that the WHILE test can be removed from the unrolled 
loop. It is necessary to ensure that the compensation code is 
always executed at least once. 
The following discusses how the unroll factor interacts 

so with the scalar replacement optimization. This is particularly 
important because the form of this optimization determines 
the unroll factor. Consider the loop shown on FIG. 9. Notice 
that the value of All] stored in the inner loop at 367 is loaded 
again two loop iterations later, when the same statement 

55 loads A[I-2] with a value of I which is incremented by 2. The 
idea behind scalar optimization which is well known in die 
prior art, is to save array values in temporary variables if 
they are accessed shortly within the next few iterations. 
Thus, the loop can be modified as is shown on FIG. 10. 

60 Here, the array reference A|I-2 1 at 391 is replaced with T, 
and it is followed by two instructions at 393 and 395 which 
assign values to T and Tl. 1\vo scalar temporary variables 
are necessary because the value of the two most recent array 
values must be saved. The value of A|M | would have been 

65 stored in the previous iteration In Tl and that is going to be 
used in the next iteration. We move Tl to T and T will be 
accessed in the next iteration. Similarly Tl. to which A|I| is 



01/29/2004, EAST Version: 1.4.1 



5,797,013 



10 



assigned will be moved toT in the next iteration and used 
2 iterations from now. 

The instruction at 395 appears to make an indexed refer- 
ence to the array A and that suggests that an array access 
must be made to get the number to put into Tl. which would 5 
be a high latency operation and would lose all that was 
gained by the optimization. What actually happens is the 
optimizer recognizes that the value of Al is stored two 
instructions earlier at 391. and that A|I| is resident in a 
register which can be stored into Tl without accessing A|I |. 1Q 
As Tl is likely to be assigned to a register, this operation is 
a register to register instruction. 

The foregoing illustrates how the various opujnization 
techniques are interrelated allowing the loop unrolling opti- 
mizer to generate code which is clearly not optimal in itself J5 
but is optimized by other optimizers in the compiler. Ini- 
tialization code is inserted at 388 to define initial values of 
T and Tl. 

FIG. 11 shows how such a loop can be unrolled according 
to the prior art if an unroll factor of three had been chosen, 2Q 
The prior art does not specify the selection of an unroll 
factor of three here, but a factor of three, or a multiple of 
three is optimal because it allows other parts of the optimizer 
to generate the code shown on FIG. 12. The selection of an 
unroll factor here of three or a multiple of three is important 25 
because three values of the array must be kept A|I|. All- 1) 
and A|I-2| if the array references are to be avoided, 

Three variables T. Tl and T2 are used, making it possible 
for other well known code optimization techniques to gen- 
erate code, such as that shown on FIG. 12. This eliminates ^ 
shuttling the temporary data from T to Tl. This is an 
example of where the nature of the code in the loop and its 
effect on scalar optimization forces a particular unroll factor. 
In general, one lists the various indexes in the loops and sorts 
them to notice the maximum distance between them, 35 

In the above case the distance between the index I and 
1-2, is 2. Adding one to this value computes a primary unroll 
factor, which is three in this case. This is an acceptable 
unroll factor. However, if it turned out to be very small, one 
might want to multiply it by a constant to get a larger unroll ^ 
factor. Alternatively, if the primary unroll factor was very 
large, one might want to divide it by the loop increment of 
it was a number other than one. The reasons for selecting a 
particular unroll factor are discussed below. 

Determining the unroll factor. 45 

Classically the prior art uses a standard unroll factor for 
all loops. Typically the number used is four. In the invention, 
the unroll factor is calculated for each loop depending on 
various factors. At one extreme some loops are not unrolled 
at all, and other loops are unrolled eight or even more times. ^ 
The disadvantages of picking too large an unroll factor are: 

1. All the loop instructions need not fit into the instruction 
cache leading to a lot of I-cache misses. 

2. The higher the unroll factor, the higher the memory 
stride of memory instructions across iterations. If the 55 
memory stride exceeds cache line size, the effective- 
ness of prefetch decreases. 

3- The resulting code is longer. Usually an upper bound 
must be chosen, an unroll count of 1000 is not likely to 
be a good idea since the compile time can go up 60 
significantly. Also excessive unrolling can adversely 
affect other optimizations which have bounds on the 
number of transformations they can make. 

On the other hand if a small unroll factor is chosen the 
following problems can occur; 65 

1. Much more time is spent executing the high latency 
branch instruction which closes the loop. 



2. The short inner loop provides many fewer opportunities 
for optimization than longer inner loops. Where the 
inner loop has high latency instructions, the compiler 
can often have them execute in parallel with low 
latency time instructions. This may not be possible in 
very short loops. 
One must keep in mind that the compiler is compiling 
loops that range from single instruction inner loops to loops 
that have scores or even hundreds of instructions, and so the 
compiler must compile code balance these considerations to 
achieve good unroll factors. To determine the unroll factor, 
the compiler considers the following in decreasing order of 
importance: 

1. There is a maximum value of the unroll factor (which 
in the preferred embodiment of the invention is eight). 

2. The number of instructions in the unrolled loop must 
not exceed a specific limit. This provides another upper 
bound to the unroll factor. 

3. If there are references to previous indexed contents of 
the array such as was shown in FIGS. 10 through 12. an 
unroll factor suggested by this analysis (or a multiple of 
it) should be used. 

4. If prefetch instructions are being generated (this is 
known based on a user defined flag), then try to pick an 
unroll factor that keeps the value of the strides of array 
references within the loop below the cache line size. 

5. If the trip count is a constant known at compile time, 
then an unroll factor that eliminates the need for a 
compensation code loop should be selected. Typically 
this would be an unroll factor of 2. 4 or 8. although 
other numbers such as 3 or 5 might be possible. 

6. If there is profile information, use that. If the profile 
informations says that the loop iterates on an average 
k times, tf k is smaller than the maximum value of the 
unroll factor as dictated by the previous steps, use fc. 
else use the maximum value of the unroll factor. 

7. If there are high latency operations within the loop such 
as divide and square root operations, use an unroll 
factor that will enhance the maximum overlap of these 
instructions. For instance, if the architecture has two 
divide units and the loop has a single divide instruction, 
the loop should be unrolled an even number of times so 
that both the divide units can be kept simultaneously 
busy. 

The algorithm that computes the unroll factor tries to 
compute an optimal and acceptable unroll factor. The cost of 
a nonoptimal unroll factor is slower run time code. As 
discussed above, the algorithm is sensitive to profile data, 
number of instructions in the loop, architecture features like 
functional units and cache line size, interactions with other 
optimizations and constant trip counts. 

Attention is directed to FIG. 13, which shows how the 
optimization algorithms presented here are implemented. At 
201 the unroll factor is determined as described above. Next, 
at 203. a check is made for the special case where the trip 
count is known at compile time and is a multiple of the 
unroll factor. In this case, the unrolled loop code is generated 
from the original loop code, any middle exits arc removed 
leaving only the final exit, and the unrolled code is output at 
242. Because this is an unrolled loop which needs no 
compensation code, none is output. The other exit occurs at 
203 where the trip count is not known at compile time, or the 
trip counts and unroll factor are such that compensation code 
must be generated. Control goes to 205 where the unrolled 
code is generated. For non- WHILE loops, the middle exits 
are removed leaving only the final exit. For a WHILE loop. 



01/29/2004, EAST Version: 1.4.1 



5.7S 

11 

the final exit is moved to the end of the loop (226) instead 
of the middle and all other exits removed. At this time a 
determination (at 229) is made using the unroll factor to 
determine if the compensation code should be output before 
or after the unrolled loop code. If it should be after, control 
goes to 244. otherwise control goes to 246. All of these three 
control paths then meet at 999. terminating the unrolling 
optimization. 

FIG. 14 is a block schematic diagram of a compiler for a 
programmable machine in accordance with the invention. 
Hie compiler of FIG. 2b shows a loop unrolling module 30. 
The preferred embodiment of the invention provides a loop 
unrolling module that is placed within the compiler as 
shown in FIG. 2b. As shown in FIG. 14. the compiler 
comprises an analysis module 1301 for analyzing and 
unrolling loops within source applications. An optimizer 
module 1302 determines an optimum unroll factor in 
response to the analysis module. An unroll module 1303 
generates an unrolled loop having said optimum unroll 
facto, while a compensation module 1304 generates and 
places any compensation code as required as a result of loop 
unroll optimization. 

Although the invention is described herein with reference 
to the preferred embodiment, one skilled in the art will 
readily appreciate that other applications may be substituted 
for those set forth herein without departing from the spirit 
and scope of the present invention. Accordingly, the inven- 
tion should only be limited by the claims included below. 

We claim: 

1. In a programmable machine, a compiler comprising: 
an analysis module for analyzing and unrolling loops 

within source applications; 
an optimizer module for determining an optimum unroll 

factor in response to said analysis module; 
an unroll module for generating an unrolled loop having 

said optimum unroll factor; and 
a compensation module for generating and placing any 

compensation code as required as a result of loop unroll 

optimization, 

wherein said compensation module performs a placement 
calculation to determine whether to put said compen- 
sation code before the unrolled loop or after the 
unrolled loop. 

2. The compiler of claim 1, wherein said compensation 
module ensures that said compensation code is executed at 
least once when said unrolled loop is executed. 

3. The compiler of claim 1. wherein said unroll factor is 
responsive to a number of instructions in the loop. 



7.013 

12 

4. The compiler of claim 1, wherein said unroll factor is 
responsive to resource usage within the loop. 

5. The compiler of claim 1, wherein said unroll factor is 
responsive to prefetch distance of memory references within 

5 the loop. 

6. The compiler of claim 1. wherein said unroll factor is 
responsive to profile information collected from previous 
executions of compiled code. 

7. The compiler of claim 1. wherein said u or oil factor is 
10 responsive to recurrence of memory references within the 

loop. 

8. The compiler of claim I. wherein said unroll factor is 
responsive to the number of instructions in the loop. 

j 5 9. The compiler of claim 1. wherein a trip count for the 
loop is known at compile time, and wherein said optimizer 
module determines an unroll factor that executes the com- 
pensation code zero times and suppresses the generation of 
said compensation code. 

20 10. The compiler of claim 1, wherein said placement 
calculation is responsive to the unroll factor computed for 
the loop. 

11. The compiler of claim 1. wherein said loop is a loop 
having an early exit. 
25 12. The compiler of claim 1. wherein loop unrolling is 
integrated with other low-level optimization phases. 

13. The compiler of claim 12. wherein said other low- 
level optimization phases include any of prefetch instruction 
insertion, register reassociation. and instruction scheduling. 
30 14. A method for unrolling loops, comprising the steps of: 
determining a unroll factor, 

generating an unrolled loop which always exits leaving a 

remaining trip count of at least one; 
35 generating compensation code; 

determining whether the compensation code should be 

placed before the unrolled loop or after the unrolled 

loop; and 

determining if the trip count is a power of two; and if it 
40 Is placing the compensation code before the unrolled 
loop. 

15. Tne method of claim 14, wherein the loop to be 
unrolled is a loop having an early exit. 

16. The method of claim 15, wherein the loop having an 
45 early exit after unrolling is transformed into a loop having an 

exit at its end. and from which all intermediate exits have 
been removed. 

* * * * * 



01/29/2004, EAST Version: 1.4.1 



